Long form transcription with model.generate

ranipakeyur commented 7 months ago

For long form transcription, How to specify following parameters while using model.generate function ? chunk_length_s batch_size ?

sanchit-gandhi commented 7 months ago

Hey @ranipakeyur - that's a great question. In Transformers, we make a distinction between model.generate and pipeline:

model.generate is a low-level way to interact with the model. It takes log-mel inputs, and returns the predicted token ids. Thus, it is left to the user to implement their own long-form transcription algorithm. In this regard, there is no notion of chunk_length_s. The batch size corresponds to the number of audio inputs you pass in one go (1 audio in -> batch size 1, 2 audios in -> batch size 2).
pipeline assumes you are working with arbitrary length audio. The audio is chunked into 30-second (or less) segments, and each one passed to model.generate to get the corresponding predictions. In this regard, it can be viewed as a "wrapper" around model.generate, one which handles long-form audio

ranipakeyur commented 7 months ago

Thank you @sanchit-gandhi for detailed answers. This helps.

huggingface / distil-whisper