Closed SeeknnDestroy closed 7 months ago
Hi @SeeknnDestroy, thanks for raising an issue!
There's three parts to the issue being raised.
With regards to the error message, this is because the deprecated argument max_length
is used for that checkpoint's generation config. @sanchit-gandhi
The second is about the transcription behaviour. @sanchit-gandhi is best placed to answer about the recommended way to treat long audio files.
The final point is how to configure the model behaviour using inference endpoints, which I'll defer to @philschmid :)
Whisper has a receptive field of 30s. For long-form transcription (>30s audio), we need to enable "chunking" to transcribe chunks of 30s audios incrementally, and then "stitch" the resulting transcriptions together at the boundaries. You can see how to run this in Python here: https://huggingface.co/distil-whisper/distil-large-v2#long-form-transcription
It's quite simple by passing one extra line to the pipeline: chunk_length_s=15
I'll leave @philschmid to advise on how to integrate this into your endpoint!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Inference Endpoints
Who can help?
@sanchit-gandhi @Narsil
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1-deploy distil-whisper/distil-large-v2 model via Inference Endpoints and above system configurations 2-run reference code its given:
Expected behavior
Ideally, the model should transcribe the full content of longer audio inputs without being constrained by the
max_length
parameter, especially given the warning about its upcoming deprecation. Above is warning that I am getting:Full warning message
Additional Context: We have Hugging Face enterprise account as @safevideo. Using
distil-whisper/distil-large-v2
for ASR, we face aUserWarning
regardingmax_length
, potentially affecting our ability to transcribe longer audio files. Seeking advice for handling this and potentially a way to get full transcription of longer audio at Inference Endpoints.