UserWarning: Using `max_length`'s default (448) at Inference Enpoint deployment

SeeknnDestroy commented 9 months ago

System Info

Inference Endpoints

Model: distil-whisper/distil-large-v2
Task: automatic-speech-recognition
Revision: c204f3c76ec464a0ab9bcfd19afa0add93f69983
Container type: Default
Instance: AWS, us-east-1
Instance Type: GPU · Nvidia Tesla T4 · 1x GPU · 16 GB

Who can help?

@sanchit-gandhi @Narsil

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

1-deploy distil-whisper/distil-large-v2 model via Inference Endpoints and above system configurations 2-run reference code its given:

import requests

API_URL = "https://ovibb90ga7zdc5qa.us-east-1.aws.endpoints.huggingface.cloud"
headers = {
    "Authorization": "Bearer XXXXXX",
    "Content-Type": "audio/flac"
}

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.post(API_URL, headers=headers, data=data)
    return response.json()

output = query("sample1.flac")

Expected behavior

Ideally, the model should transcribe the full content of longer audio inputs without being constrained by the max_length parameter, especially given the warning about its upcoming deprecation. Above is warning that I am getting:

Full warning message

2023/12/13 14:22:36 ~ /opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (448) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.

Additional Context: We have Hugging Face enterprise account as @safevideo. Using distil-whisper/distil-large-v2 for ASR, we face a UserWarning regarding max_length, potentially affecting our ability to transcribe longer audio files. Seeking advice for handling this and potentially a way to get full transcription of longer audio at Inference Endpoints.

amyeroberts commented 9 months ago

Hi @SeeknnDestroy, thanks for raising an issue!

There's three parts to the issue being raised.

With regards to the error message, this is because the deprecated argument max_length is used for that checkpoint's generation config. @sanchit-gandhi

The second is about the transcription behaviour. @sanchit-gandhi is best placed to answer about the recommended way to treat long audio files.

The final point is how to configure the model behaviour using inference endpoints, which I'll defer to @philschmid :)

sanchit-gandhi commented 9 months ago

Whisper has a receptive field of 30s. For long-form transcription (>30s audio), we need to enable "chunking" to transcribe chunks of 30s audios incrementally, and then "stitch" the resulting transcriptions together at the boundaries. You can see how to run this in Python here: https://huggingface.co/distil-whisper/distil-large-v2#long-form-transcription It's quite simple by passing one extra line to the pipeline: chunk_length_s=15

I'll leave @philschmid to advise on how to integrate this into your endpoint!

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers