khanld / ASR-Wav2vec-Finetune

:zap: Finetune Wa2vec 2.0 For Speech Recognition
111 stars 23 forks source link

Can I use an English dataset for this repo? #7

Open Shaobo-Z opened 1 year ago

Shaobo-Z commented 1 year ago

In the source code, you used Vietnamese for training and validation. If I want to fine-tune a model that is in English and has English dataset, is there anything that I should change?

khanld commented 1 year ago

No, you just have to prepare the English dataset

Shaobo-Z commented 1 year ago

This is how my dataset looks like ⬇ image

And this is what I got ⬇. There is changes with train_loss, train_lr,..... However, the train_wer is always 1.0000. image

Checked:

  1. Sample Rate: by using librosa.get_samplerate. I got 16000.
  2. Transcript is correct.
  3. Only modify the file_path and iteration in the configure file.
  4. The pre-trained model is facebook/wave2vec2-base.

I tried multiple ways. However, the result remains the same. Any ideas? Plz.

khanld commented 1 year ago

I can see that your dataset is relatively small, so the number of update steps per epoch is only 5. Have your try a longer run and check if the behavior remains. Take a look at the vocab.json file whether it contains the correct English characters.

ghosthunterk commented 1 year ago

Encountered the same problem even with larger dataset (91 steps and 20 epochs).

khanld commented 1 year ago

I have not tried on other language datasets yet. Can you share more information about your dataset, config, tensorboard,…

ghosthunterk commented 1 year ago

Python 3.8 Pip install all in requirements.txt, with exception of torch 1.7.1 i had to use (conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch) because I have CUDA 11.4 I tried both vivos dataset and common voice dataset, store them in .txt with panda seperatated by "|" and 2 column: path (path on server) and transcript (encoded utf-8) When I tried to print the pred and label, i got these image

ghosthunterk commented 1 year ago

Audio are already pre-processed to be 16000 sampling rate and .wav format image

khanld commented 1 year ago

i can see that your model did not converge yet, train loss is still high. Try increase the lr higher for faster training

khanld commented 1 year ago

Ping me at mail khanhld218@gmail.com for better debugging since I rarely check the GitHub notifications

ghosthunterk commented 1 year ago

Ping me at mail khanhld218@gmail.com for better debugging since I rarely check the GitHub notifications

Already, thanks

Shaobo-Z commented 1 year ago

Is it possible to get an update on this question? What is the minimum size of the dataset? I want to train the model with a 20mins dataset. Do you think it is possible?


From: ghosthunterk @.> Sent: Friday, July 21, 2023 5:33:53 PM To: khanld/ASR-Wav2vec-Finetune @.> Cc: Shaobo-Z @.>; Author @.> Subject: Re: [khanld/ASR-Wav2vec-Finetune] Can I use an English dataset for this repo? (Issue #7)

Ping me at mail @.**@.> for better debugging since I rarely check the GitHub notifications

Already, thanks

— Reply to this email directly, view it on GitHubhttps://github.com/khanld/ASR-Wav2vec-Finetune/issues/7#issuecomment-1645119163, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJHDBZHB2CRHQRPSO6CQOU3XRIWGDANCNFSM6AAAAAAZ3HZPXY. You are receiving this because you authored the thread.Message ID: @.***>

khanld commented 1 year ago

I will take a look at my codes and run some experiments on english datasets and response to you soon @Shaobo-Z

ghosthunterk commented 1 year ago

image So after having experimented a while, I found that increasing the learning rate (about >1e-5) and set the scheduler max learning rate to >=1e-4 helped the model to actually learn after a while, just be patient.