Support shorter input length for Whisper encoder

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.5k stars 26.9k forks source link

Support shorter input length for Whisper encoder #25744

Closed gau-nernst closed 1 year ago

gau-nernst commented 1 year ago

Feature request

Right now Whisper requires audio data to be padded or truncated to exactly 30s (i.e. mel-spec with shape (B, 80, 3000)). In the encoder, the only operation that prevents WhisperEncoder from working on shorter audio is positional embeddings.

https://github.com/huggingface/transformers/blob/41aef33758ae166291d72bc381477f2db84159cf/src/transformers/models/whisper/modeling_whisper.py#L902

Simply truncate the the positional embeddings will make WhisperEncoder works for shorter audio

    hidden_states = inputs_embeds + embed_pos[:input_embeds.shape[1]]

@sanchit-gandhi Knowing you are the audio guy at HF, I would like to hear your inputs on this.

Motivation

I do understand that the original OpenAI code forces all input to the model to be (B, 80, 3000). However, for encoder-only tasks, such as audio classification, padding zeros is wasteful, and can even be detrimental, since the zeros may "dilute" the prediction scores.

Your contribution

The fix is simple, like I mentioned above. I don't think it will affect the ASR pipeline as long as the FeatureExtractor still enforces padding to 3000.

sanchit-gandhi commented 1 year ago

Hey @gau-nernst! I've tried this before and it works very well (even with the pre-trained model). Sharing a codesnippet that shows how I did this: https://github.com/sanchit-gandhi/codesnippets/blob/main/whisper-reduce-context.ipynb

However, I'm not sure it's necessarily a feature that warrants adding to the transformers library: IMO it's quite a niche application of the Whisper model that diverges somewhat from the official implementation, and probably will add extra complexity for a feature that is not used so much

I think we can reserve this as a feature that more advanced users can achieve by following the codesnippet shared above? WDYT?

sanchit-gandhi commented 1 year ago

Also cc @ArthurZucker for interest!

gau-nernst commented 1 year ago

Thank you @sanchit-gandhi for the reply. Sadly, for my use case, the input size is not always fixed (I perform padding to the longest in a batch) so I need a "dynamic" way to accept shorter audio. I understand the concerns regarding diverging from the official implementation, so it's ok to not implement this feature.

sanchit-gandhi commented 1 year ago

Hmm I see! That's indeed a bit more tricky - in this case, slicing the embeddings on the fly is probably your best bet, as you identified above!

gau-nernst commented 1 year ago

Do you think this feature should be implemented in transformers? I understand if you don't want to. I will close this issue in that case.

sanchit-gandhi commented 1 year ago

I think not as it lends itself to silent errors if the audio is not correctly pre-processed. But more than happy to help you with implementing yourself locally!

bcl213124314 commented 1 year ago

Sorry for replying to the old issue.

I agree that always doing the slicing is error prone. In the issue linked above I proposed passing the embedding by key word args so users have to explicitly construct their own embeddings and pass them as args if they want to use this. That way, wrongly processed inputs still error without the opt in.

I used your approach but i would like the whole process to be as thread safe as possible. Setting the encoder weights on the fly isn't. Do you have any ideas how to do it in a safer way? In the end I would like to load the model once and use it for different tasks which might need different embedding length.

gau-nernst commented 1 year ago

A temporary solution for me is to edit the Whisper modelling file directly. A better solution would be to subclass WhisperEncoder and override the forward() method to your liking.

bcl213124314 commented 1 year ago

Is there a nice way to load and save the models with a subclassed encoder without subclassing the model itself as well?

gau-nernst commented 1 year ago

There are hacks you can do. For example, you can replace an object's class with your own (https://stackoverflow.com/questions/15404256/changing-the-class-of-a-python-object-casting). You can also monkey-patch an object's method (https://www.pythonforthelab.com/blog/monkey-patching-and-its-consequences/). To make it work with any models that contain WhisperEncoder, you can iterate over its modules .modules(), check isinstance(modue, WhisperEncoder), and patch it accordingly.

Model weights shouldn't be a problem if you don't change the shape of the weight.

bcl213124314 commented 1 year ago

Yeah hacks work of course. That's why i asked for a nice method haha.

But thanks anyways!

farzadab commented 5 months ago

@sanchit-gandhi will you be open to allowing this fix given extra precautions (for example an environment variable guard)?

I believe your argument was that this is a niche issue, but as there have been multiple people asking for this over the last year, I'm wondering if you still think that way. I've seen Whisper being overlooked in some applications simply because of this artificial constraint.

sanchit-gandhi commented 3 months ago

Hey @farzadab - sorry to circle back late here! Could you detail an instance where it's not possible to pad the audios to 30-seconds and use the full context window? Overall, we're striving to maintain WER equivalence with the original codebase, where we've seen that using the full context window is necessary.