Note: As I no longer maintain the repo, If you encounter any problems, please take a look at similar reported issues from fairseq repo.
This is a wrapper version of wav2vec 2.0 framework, which attempts to build an accurate speech recognition models with small amount of transcribed data (eg. 1 hour)
Transfer learning is still the main technique:
The more you have, the better the model is. Prepare at least 1 hour if you have a large amount of unlabeled data. Otherwise, at least 50 hours is recommended.
This should includes both well-written text and conversational text, which can easily collected from news/forums websties. At least 1 GB of text is recommended.
This is optional but very crucial. A good amount of unlabeled audios (eg. 500 hours) will significantly reduce the amount of labeled data needed, and also boost up the model performance. Youtube/Podcast is a great place to collect the data for your own language
Please follow this instruction
Collect unlabel audios and put them all together in a single directory. Audio format requirements:\ Format: wav, PCM 16 bit, single channel\ Sampling_rate: 16000\ Length: 5 to 30 seconds\ Content: silence should be removed from the audio. Also, each audio should contain only one person speaking.\ Please look at examples/unlabel_audio directory for reference.
Instead of training from scratch, we download and use english wav2vec model for weight initialization. This pratice can be apply to all languages.
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt
python3 pretrain.py --fairseq_path path/to/libs/fairseq --audio_path path/to/audio_directory --init_model path/to/wav2vec_small.pt
Where:
Logs and checkpoints will be stored at outputs directory\ Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.\ Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt\ In my casse, it took ~ 4 days for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.
-- Transcript file ---\ One trainng sample per line with format "audio_absolute_path \tab transcript"\ Example of a transcript file:
/path/to/1.wav AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
/path/to/2.wav AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
/path/to/3.wav JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
/path/to/4.wav BEING THIN TOUGH AND OPAQUE
Some notes on transcript file:
-- Labeled audio file ---\ Format: wav, PCM 16 bit, single channel, Sampling_rate: 16000.\ Silence should be removed from the audio.\ Also, each audio should contain only one person speaking.\
python3 gen_dict.py --transcript_file path/to/transcript.txt --save_dir path/to/save_dir
The dictionary file will be stored at save_dir/dict.ltr.txt. Use the file for fine-tuning and inference.
python3 finetune.py --transcript_file path/to/transcript.txt --pretrain_model path/to/pretrain_checkpoint_best.pt --dict_file path/to/dict.ltr.txt
Where:
Logs and checkpoints will be stored at outputs directory\ Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.\ Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt\ In my casse, it took ~ 12 hours for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.
Collect all texts and put them all together in a single file. \ Text file format:
Example of a text corpus file for English case:
AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
BEING THIN TOUGH AND OPAQUE
...
Example of a text corpus file for Chinese case:
每 个 人 都 有 他 的 作 战 策 略 直 到 脸 上 中 了 一 拳
这 是 我 年 轻 时 候 住 的 房 子 。
这 首 歌 使 我 想 起 了 我 年 轻 的 时 候 。
...
python3 train_lm.py --kenlm_path path/to/libs/kenlm --transcript_file path/to/transcript.txt --additional_file path/to/text_corpus.txt --ngram 3 --output_path ./lm
Where:
The LM model and the lexicon file will be stored at output_path
from stt import Transcriber
transcriber = Transcriber(pretrain_model = 'path/to/pretrain.pt', finetune_model = 'path/to/finetune.pt',
dictionary = 'path/to/dict.ltr.txt',
lm_type = 'kenlm',
lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
print(hypos)
Where:
Note: If you running inference in a juyter notebook. Please add these lines above the inference script:
import sys
sys.argv = ['']
https://github.com/mailong25/self-supervised-speech-recognition/tree/vietnamese
Paper: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations: https://arxiv.org/abs/2006.11477 \ Source code: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md
MIT