jonatasgrosman / huggingsound

HuggingSound: A toolkit for speech-related tasks based on Hugging Face's tools
MIT License
432 stars 43 forks source link

Inference issue after finetuning for spanish #39

Open ogarciasierra opened 2 years ago

ogarciasierra commented 2 years ago

Hi!

First of all, thank you for your code and your models! Really really useful!

I've used the finetuning script to try to finetune it to spanish with common voice dataset. However, after inference, given any audio from the test set of common voice, the model generate an empty string. Have you faced this issue before? I didn't change anything from your script, so I don't know where the problem could be.

Thanks again! (ps. your jonatasgrosman/wav2vec2-xls-r-1b-spanish model is amazing, congrats!!)

siasio commented 2 years ago

Hello!

I'm using the code at google colab for a different language, and I am facing the same issue. Yesterday, I ran the training for almost two hours, but during inference I always get an empty string.

I looked a bit into what happens during the inference, and I saw that torch.argmax(logits, dim=-1) returns an array of the padding tokens. For comparison, torch.argmin returns mostly start of sentence and end of sentence tokens.

ogarciasierra commented 2 years ago

Hi! And do you see any solution? Thanks!

jonatasgrosman commented 2 years ago

Hi @siasio and @ogarciasierra !

I don't know what's going on :( Could you send me a colab that reproduces that issue?

siasio commented 2 years ago

I tried to google the issue and found such an advice on one forum:

"The usual finetuning training behavior looks sth like:

I decided to give it a try and run the training for more time. However, I am often running into the out-of-memory error so it goes slowly and I haven't obtained good results yet.

I am running the training on the set of mp4 videos of length <20s (altogether it's 44 minutes of recordings). I use batch size = 2 (due to the memory issues). So far I've run a few epochs of training (564 training steps altogether) and I am getting empty strings in transcriptions, wer=1.0, cer=1.0 in evaluation. The videos I use have various sampling rates but, as far as I understand, it doesn't matter because huggingsound resamples them to 16kHz.

EDIT: What is your experience @jonatasgrosman with training? Does 564 steps for batch size = 2 sound like way too few? If so, I will just keep training the model for more epochs. Otherwise, it might be useful to look into the code / at the data.

jonatasgrosman commented 2 years ago

Hi @siasio

In my experience, 564 steps with a batch size of 2 samples are too little to see good results in fine-tuning those wav2vec2-based models, unless you're using an already fine-tuned model and are just trying to adapt it to a new domain.

What model are you fine-tuning?

siasio commented 2 years ago

I'm fine-tuning wav2vec2-large-xlsr-53 on recordings of Russian language. I didn't take a fine-tuned Russian model because I'm trying to predict not transcriptions but information on the lexical accent for all syllables (EDIT: so I need a different token set).

At what number of steps is it expected to start getting sensible results?

jonatasgrosman commented 2 years ago

There's no rule of thumb here 'cause it depends on your hyperparameters and dataset. But I've generally seen promising outputs only after 1k steps. However, I generally use a batch size larger than yours, with at least 24 samples of < 20-sec audios.

If you don't have enough memory to increase the batch size, you can use the _gradient_accumulationsteps parameter to emulate that. With a _gradient_accumulationsteps=12 and _batchsize=2 you can have a effective batch size of 24 samples. However, to reach an equivalent fine-tuning of 1k steps, you'll also need to multiply the _maxsteps by the _gradient_accumulationsteps, so your max_steps parameter needs to be 12000.

I hope I've helped you :)

siasio commented 2 years ago

Thank you a lot for the advices!

qinyuenlp commented 2 years ago

I tried to google the issue and found such an advice on one forum:

"The usual finetuning training behavior looks sth like:

  • Beginning: Output random chars
  • Early: Output nothing - empty strings - looks like you are here?
  • After a while: Starts to spit out more relevant chars"

I decided to give it a try and run the training for more time. However, I am often running into the out-of-memory error so it goes slowly and I haven't obtained good results yet.

I am running the training on the set of mp4 videos of length <20s (altogether it's 44 minutes of recordings). I use batch size = 2 (due to the memory issues). So far I've run a few epochs of training (564 training steps altogether) and I am getting empty strings in transcriptions, wer=1.0, cer=1.0 in evaluation. The videos I use have various sampling rates but, as far as I understand, it doesn't matter because huggingsound resamples them to 16kHz.

EDIT: What is your experience @jonatasgrosman with training? Does 564 steps for batch size = 2 sound like way too few? If so, I will just keep training the model for more epochs. Otherwise, it might be useful to look into the code / at the data.

Have you solve it? I met the same problem. I fine-tuned a chinese wav2vec2.0(large) with mini_batch_size = 4 and gradient_accumulation_steps = 32(as batch_size=128) for almost 30000 steps, but the model still not converge.

I found the same theory as you writen before, but it is really confuse me a lot.

qinyuenlp commented 2 years ago

I tried to google the issue and found such an advice on one forum:

"The usual finetuning training behavior looks sth like:

  • Beginning: Output random chars
  • Early: Output nothing - empty strings - looks like you are here?
  • After a while: Starts to spit out more relevant chars"

I decided to give it a try and run the training for more time. However, I am often running into the out-of-memory error so it goes slowly and I haven't obtained good results yet.

I am running the training on the set of mp4 videos of length <20s (altogether it's 44 minutes of recordings). I use batch size = 2 (due to the memory issues). So far I've run a few epochs of training (564 training steps altogether) and I am getting empty strings in transcriptions, wer=1.0, cer=1.0 in evaluation. The videos I use have various sampling rates but, as far as I understand, it doesn't matter because huggingsound resamples them to 16kHz.

EDIT: What is your experience @jonatasgrosman with training? Does 564 steps for batch size = 2 sound like way too few? If so, I will just keep training the model for more epochs. Otherwise, it might be useful to look into the code / at the data.

And, this problem seems like the property of CTC loss.

siasio commented 2 years ago

Have you solve it? I met the same problem. I fine-tuned a chinese wav2vec2.0(large) with mini_batch_size = 4 and gradient_accumulation_steps = 32(as batch_size=128) for almost 30000 steps, but the model still not converge.

I found the same theory as you writen before, but it is really confuse me a lot.

Hello! For me the Jonatas' advice proved to be sufficient. I just needed to wait more. At one point I started getting strings consisting of the same character multiple times, and soon also other characters started appearing. It took even less steps than what Jonatas said but maybe the task I was solving was just a bit easier than a usual transcription (I only needed to distinguish between vowels, I didn't care about consonants).

It seems that you are waiting already a bit long so I don't know what is wrong. Perhaps, your task is a bit more difficult for the network - could it be that a transcription of Chinese is more difficult than English? Not sure. Perhaps, the problem is somewhere else.

qinyuenlp commented 2 years ago

Have you solve it? I met the same problem. I fine-tuned a chinese wav2vec2.0(large) with mini_batch_size = 4 and gradient_accumulation_steps = 32(as batch_size=128) for almost 30000 steps, but the model still not converge. I found the same theory as you writen before, but it is really confuse me a lot.

Hello! For me the Jonatas' advice proved to be sufficient. I just needed to wait more. At one point I started getting strings consisting of the same character multiple times, and soon also other characters started appearing. It took even less steps than what Jonatas said but maybe the task I was solving was just a bit easier than a usual transcription (I only needed to distinguish between vowels, I didn't care about consonants).

It seems that you are waiting already a bit long so I don't know what is wrong. Perhaps, your task is a bit more difficult for the network - could it be that a transcription of Chinese is more difficult than English? Not sure. Perhaps, the problem is somewhere else.

Thank you! I found I made a stupid mistake on setting learning rate. After continue fine-tuning the model I've got last week with a smaller lr, now the model has converged. The CTCLoss dose cost more patience.

olek-pg commented 2 years ago

@qinyuenlp what learning rate did you use in the end?

qinyuenlp commented 2 years ago

@qinyuenlp what learning rate did you use in the end?

5e-4