ehcalabres commented 3 years ago

Adding a Wav2Vec2ForSpeechClassification class 🚀

Right now, using any of the Wav2Vec 2.0 models available on the 🤗hub and make a fine-tuning process to resolve a speech classification task implies creating a new class that inherit his behaviour from the Wav2Vec2PreTrainedModel class. Although creating this types of models can be done with a bit of research, I find too complicated to just use a fine-tuned model when shared on the 🤗hub, because you need to have access to the code of the model class in order to instantiate it and retrieve the model with the from_pretrained() method (which may or may not be available at that time).

I think that adding a class to the 🤗transformers library like Wav2Vec2ForSpeechClassification (i.e. the same way that works for the BertForSequenceClassification models and others similar) will be a very nice feature in order to not just be able to fine-tune Wav2Vec 2.0 for classification tasks but also it would simplify and accelerate the way one can use a shared model.

Motivation

Speech has always been a very awesome field of research both in the way a user interacts with a physical system, and vice versa. Taking this into account, and with the great news of having the new Wav2Vec 2.0 model integrated on the 🤗transformers library 🎉, I started a research project on Speech Emotion Recognition (SER) with the idea of fine-tune a Wav2Vec 2.0 model in this type of emotional datasets. The results that I've obtained are very promising and the model seems to work extremely well, so I decided to put the fine-tuned model on the 🤗hub (wip). Additionally, I saw on the 🤗 discussion forums a topic about this same task of SER implementation with its corresponding model on the 🤗hub, which have the same issue when importig it.

With all this, I think that the number of use cases of the Wav2Vec2 model for speech classification tasks are huge and having a feature like this one implemented would simplify a lot the way other developers and researchers can work with this type of pretrained models.

Your contribution

I can start working in a new PR to overcome this situation by implementing the Wav2Vec2ForSpeechClassification class that I mentioned before in the library. I already have the code working and in fact it's pretty similar to the other nlp models that include the SequenceClassification feature.

The idea behind this is to have a much more simplified and generalized way to use and train this models, getting as final result this snippet for a straight forward use of them.

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSpeechClassification

processor = Wav2Vec2FeatureExtractor.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")

model = Wav2Vec2ForSpeechClassification.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")

Let me know if this feature fits the needs of the library in terms of simplicity and integration, and I will start a new PR with these changes. Also let me know if it is useful and cover an adecuate number of use cases, making it worth of implementing.

Thank you all for your amazing work 🥇

patrickvonplaten commented 2 years ago

Hey @ehcalabres,

I'm only seeing your issue now sadly :-/ Super sorry to not have answered sooner. @anton-l is working on an official Wav2Vec2- and HubertForSequenceClassification at the moment, here: https://github.com/huggingface/transformers/pull/13153 which should serve your needs then :-)

It would be great if you could take a look at https://github.com/huggingface/transformers/pull/13153 to see whether this design/architecture fits your needs

ehcalabres commented 2 years ago

Hey @patrickvonplaten, @anton-l,

Thanks a lot for your answer! As I'm seeing on the issue #13153 , it seems like it's pretty much the same as I was proposing here, so I think it'll do the job for this kind of audio classification tasks. I'll try it when it comes out but it seems to be fine by the moment. Great!

Only one thing, I've work mostly in PyTorch but as I was checking the code I've seen that there's no TensorFlow version of these models (neither for Hubert or Wav2Vec2), do you think it's relevant to implement them? If so maybe I can help with that, but I don't know if it's something critical.

Anyway, is there anything else I can do to help you with this? Just let me know.

Thanks again!

huggingface / transformers