mailong25 / self-supervised-speech-recognition

speech to text with self-supervised learning based on wav2vec 2.0 framework
382 stars 115 forks source link

-- #36

Open YYUUUY opened 3 years ago

YYUUUY commented 3 years ago

--

mailong25 commented 3 years ago

Currently working on that. If you want to decode using transformer LM for English. Please do the following:

mkdir trans_LM ; cd trans_LM
wget https://github.com/mailong25/self-supervised-speech-recognition/blob/master/examples/lm_librispeech_word_transformer.dict
wget https://dl.fbaipublicfiles.com/wav2letter/sota/2019/lm/lm_librispeech_word_transformer.pt
wget https://github.com/mailong25/self-supervised-speech-recognition/blob/master/examples/dict.txt
cd ...

Then run the inference as follow:

from stt import Transcriber
transcriber = Transcriber(pretrain_model = 'path/to/pretrain.pt', finetune_model = 'path/to/finetune.pt', 
                          dictionary = 'path/to/dict.ltr.txt',
                          lm_type = 'fairseqlm',
                          lm_lexicon = 'path/to/trans_LM/lm_librispeech_word_transformer.dict,
                          lm_model  = 'path/to/trans_LM/lm_librispeech_word_transformer.pt,
                          lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
print(hypos)
mailong25 commented 3 years ago

The pre-train model should be the model with no fine-tuning on the labeled data https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt

YYUUUY commented 3 years ago

@mailong25 Thank you