ImtiazKhanDS commented 3 years ago

🚀 Feature request

How to train the new wav2vec unsupervised model using hugging face ? , The paper link is : https://ai.facebook.com/research/publications/unsupervised-speech-recognition

Motivation

Your contribution

patrickvonplaten commented 3 years ago

The pretraining of wav2vec2-u is a pretty complex training pipeline. It'll probably still take a bit until we have this merged

shiv6146 commented 2 years ago

@patrickvonplaten @patil-suraj any updates on this yet ?

patrickvonplaten commented 2 years ago

I won't have time in the near future to work on this - feel free to give it a try though. It's a very cool paper :-)

neonbjb commented 2 years ago

Hey HF team, I see you have an example up of how to perform this pre-training. First of all - thank you very much for this work!

I'm trying to use this code to train a wav2vec-style model for music. As indicated was likely in the above link, I was running into some training stability issues.

One thing that particularly helped me with this was reducing the codebook size. The wav2vec paper does an ablation study in the number of groups and vectors (G and V) and found that small codebooks work very well. I have been experimenting with G=8 and V=8 and it seems more likely to produce a stable training run for my dataset. Might be worth looking into for librispeech if you find the time (or if someone else sees this and is struggling).

I also had one other question: What was the reasoning behind this initialization choice? https://github.com/huggingface/transformers/blob/main/src/transformers/models/wav2vec2/modeling_wav2vec2.py#L1055

The mean and variance of the initialized Linear weights after this initialization is very close to the same statistics for the default pytorch initialization (which uses kaiming_uniform init). The difference with your initialization is that it doesn't automatically scale with fan_in and it draws from a normal distribution. I didn't see anything in the paper about either of these details and was just wondering why this was done.

Thanks again for this! It's great work!

patrickvonplaten commented 2 years ago

Hey @neonbjb,

I think the init here was just a copy-paste from what we had for other models. I think fairseq is actually using the default init values for the attention layers: https://github.com/facebookresearch/fairseq/blob/b5a039c292facba9c73f59ff34621ec131d82341/fairseq/modules/multihead_attention.py#L64 . So maybe we should use this as well here. Does kaiming_uniform_init work better for you? Definitely open for a PR here to change it

neonbjb commented 2 years ago

I don't think the choice between uniform or normal distributions in the init made an appreciable difference, I was just trying to understand the choice. Reducing the size of V (and increasing G) made the biggest difference in stability.

patrickvonplaten commented 2 years ago

BTW, if I understood correctly, the Data2Vec guys stated that Data2Vec performs better than Wav2Vec2 mainly because it makes no assumption about the number of sound units a spoken language has (= the number of codebook vectors). This codebook vector is a somewhat arbitrary choice and can vary strongly depending on the language. A big gain from Data2Vec is that there is no such hyper-parameter as a codebook which makes the model generalize better.

@alexeib please correct me if I'm wrong here :sweat_smile:

huggingface / transformers

How to train the new wav2vec unsupervised model using hugging face ? #12144

🚀 Feature request

Motivation

Your contribution