huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.02k stars 26.54k forks source link

UserWarning: Using `max_length`'s default (448) at Inference Enpoint deployment #28001

Closed SeeknnDestroy closed 7 months ago

SeeknnDestroy commented 9 months ago

System Info

Inference Endpoints

Who can help?

@sanchit-gandhi @Narsil

Information

Tasks

Reproduction

1-deploy distil-whisper/distil-large-v2 model via Inference Endpoints and above system configurations 2-run reference code its given:

import requests

API_URL = "https://ovibb90ga7zdc5qa.us-east-1.aws.endpoints.huggingface.cloud"
headers = {
    "Authorization": "Bearer XXXXXX",
    "Content-Type": "audio/flac"
}

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.post(API_URL, headers=headers, data=data)
    return response.json()

output = query("sample1.flac")

Expected behavior

Ideally, the model should transcribe the full content of longer audio inputs without being constrained by the max_length parameter, especially given the warning about its upcoming deprecation. Above is warning that I am getting:

Full warning message

2023/12/13 14:22:36 ~ /opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (448) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.

Additional Context: We have Hugging Face enterprise account as @safevideo. Using distil-whisper/distil-large-v2 for ASR, we face a UserWarning regarding max_length, potentially affecting our ability to transcribe longer audio files. Seeking advice for handling this and potentially a way to get full transcription of longer audio at Inference Endpoints.

amyeroberts commented 9 months ago

Hi @SeeknnDestroy, thanks for raising an issue!

There's three parts to the issue being raised.

With regards to the error message, this is because the deprecated argument max_length is used for that checkpoint's generation config. @sanchit-gandhi

The second is about the transcription behaviour. @sanchit-gandhi is best placed to answer about the recommended way to treat long audio files.

The final point is how to configure the model behaviour using inference endpoints, which I'll defer to @philschmid :)

sanchit-gandhi commented 9 months ago

Whisper has a receptive field of 30s. For long-form transcription (>30s audio), we need to enable "chunking" to transcribe chunks of 30s audios incrementally, and then "stitch" the resulting transcriptions together at the boundaries. You can see how to run this in Python here: https://huggingface.co/distil-whisper/distil-large-v2#long-form-transcription It's quite simple by passing one extra line to the pipeline: chunk_length_s=15

I'll leave @philschmid to advise on how to integrate this into your endpoint!

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.