Closed gau-nernst closed 1 year ago
Hey @gau-nernst! I've tried this before and it works very well (even with the pre-trained model). Sharing a codesnippet that shows how I did this: https://github.com/sanchit-gandhi/codesnippets/blob/main/whisper-reduce-context.ipynb
However, I'm not sure it's necessarily a feature that warrants adding to the transformers
library: IMO it's quite a niche application of the Whisper model that diverges somewhat from the official implementation, and probably will add extra complexity for a feature that is not used so much
I think we can reserve this as a feature that more advanced users can achieve by following the codesnippet shared above? WDYT?
Also cc @ArthurZucker for interest!
Thank you @sanchit-gandhi for the reply. Sadly, for my use case, the input size is not always fixed (I perform padding to the longest in a batch) so I need a "dynamic" way to accept shorter audio. I understand the concerns regarding diverging from the official implementation, so it's ok to not implement this feature.
Hmm I see! That's indeed a bit more tricky - in this case, slicing the embeddings on the fly is probably your best bet, as you identified above!
Do you think this feature should be implemented in transformers
? I understand if you don't want to. I will close this issue in that case.
I think not as it lends itself to silent errors if the audio is not correctly pre-processed. But more than happy to help you with implementing yourself locally!
Sorry for replying to the old issue.
I agree that always doing the slicing is error prone. In the issue linked above I proposed passing the embedding by key word args so users have to explicitly construct their own embeddings and pass them as args if they want to use this. That way, wrongly processed inputs still error without the opt in.
I used your approach but i would like the whole process to be as thread safe as possible. Setting the encoder weights on the fly isn't. Do you have any ideas how to do it in a safer way? In the end I would like to load the model once and use it for different tasks which might need different embedding length.
A temporary solution for me is to edit the Whisper modelling file directly. A better solution would be to subclass WhisperEncoder
and override the forward()
method to your liking.
Is there a nice way to load and save the models with a subclassed encoder without subclassing the model itself as well?
There are hacks you can do. For example, you can replace an object's class with your own (https://stackoverflow.com/questions/15404256/changing-the-class-of-a-python-object-casting). You can also monkey-patch an object's method (https://www.pythonforthelab.com/blog/monkey-patching-and-its-consequences/). To make it work with any models that contain WhisperEncoder
, you can iterate over its modules .modules()
, check isinstance(modue, WhisperEncoder)
, and patch it accordingly.
Model weights shouldn't be a problem if you don't change the shape of the weight.
Yeah hacks work of course. That's why i asked for a nice method haha.
But thanks anyways!
@sanchit-gandhi will you be open to allowing this fix given extra precautions (for example an environment variable guard)?
I believe your argument was that this is a niche issue, but as there have been multiple people asking for this over the last year, I'm wondering if you still think that way. I've seen Whisper being overlooked in some applications simply because of this artificial constraint.
Hey @farzadab - sorry to circle back late here! Could you detail an instance where it's not possible to pad the audios to 30-seconds and use the full context window? Overall, we're striving to maintain WER equivalence with the original codebase, where we've seen that using the full context window is necessary.
Feature request
Right now Whisper requires audio data to be padded or truncated to exactly 30s (i.e. mel-spec with shape
(B, 80, 3000)
). In the encoder, the only operation that prevents WhisperEncoder from working on shorter audio is positional embeddings.https://github.com/huggingface/transformers/blob/41aef33758ae166291d72bc381477f2db84159cf/src/transformers/models/whisper/modeling_whisper.py#L902
Simply truncate the the positional embeddings will make WhisperEncoder works for shorter audio
@sanchit-gandhi Knowing you are the audio guy at HF, I would like to hear your inputs on this.
Motivation
I do understand that the original OpenAI code forces all input to the model to be
(B, 80, 3000)
. However, for encoder-only tasks, such as audio classification, padding zeros is wasteful, and can even be detrimental, since the zeros may "dilute" the prediction scores.Your contribution
The fix is simple, like I mentioned above. I don't think it will affect the ASR pipeline as long as the FeatureExtractor still enforces padding to 3000.