Closed ranipakeyur closed 7 months ago
Hey @ranipakeyur - that's a great question. In Transformers, we make a distinction between model.generate
and pipeline
:
model.generate
is a low-level way to interact with the model. It takes log-mel inputs, and returns the predicted token ids. Thus, it is left to the user to implement their own long-form transcription algorithm. In this regard, there is no notion of chunk_length_s
. The batch size corresponds to the number of audio inputs you pass in one go (1 audio in -> batch size 1, 2 audios in -> batch size 2).pipeline
assumes you are working with arbitrary length audio. The audio is chunked into 30-second (or less) segments, and each one passed to model.generate
to get the corresponding predictions. In this regard, it can be viewed as a "wrapper" around model.generate
, one which handles long-form audioThank you @sanchit-gandhi for detailed answers. This helps.
For long form transcription, How to specify following parameters while using model.generate function ? chunk_length_s batch_size ?