microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

SpeechLM: How to train 'Phone-unit tokenizer for speech' using kaldi? #23

Closed YWMditto closed 1 year ago

YWMditto commented 1 year ago

Hello, congratulations on your success in this paper!

I want to ask that is there any training scripts for 'Phone-unit tokenizer for speech' part, which is using kaldi recipe to "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data".

I'm new at speech processing, especially at using traditional HMM model which is used by kaldi, so it would be very thankful for you to answer.

Thanks a lot!

YWMditto commented 1 year ago

By the way, I have seen one typo in your README.md: image

YWMditto commented 1 year ago

Hello again, your work is so greate, and let us see the greate understanding power of the pretrained models in both audio and text, and to further increase the impact of this work, we want to transfer it into Chinese language. So we would like to ask more training details, mostly about the data preprocess part, it would be very appreciated for you to answer these questions!

  1. How could we get these dict files that are used when preparing the pre-training data?
    
    # text
    SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.ltr.txt
    SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.phn.txt
    SpeechLM/dataset/LibriLM/phone_unit/align_lexicon.txt (we notice it is directly downloaded so we wonder that how could we generate this file by ourselves? Does this is derived when training acoustic models using kaldi?)

SpeechLM/speechlm/data_process/phoneme_tokenizer/mean5_and_std25_sil14_spn32.dict

audio

SpeechLM/dataset/LibriSpeech/phone_unit/dict.phn.txt



Are these dict files generated when training asr models using kaldi? Or some other scripts? Could these scripts be shared?

Very thankful for your answer! It means a lot to us!
zz12375 commented 1 year ago

Thanks, YWMditto, we have fixed the typo. 1) Regarding "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data", I recommend you to follow https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh.

2) Regarding the following files:

Sorry for the late response, I help the information could help you!

YWMditto commented 1 year ago

OkOk, thanks a lot, I think these will help a lot, I will follow these suggestions so please do not shut down this issue so I can maybe ask some other questions.

Thanks again for your reply!

zz12375 commented 1 year ago

OK, you can also contact me by email if you want. See my github profile.

YWMditto commented 1 year ago

Ok! Thanks very much!

YWMditto commented 1 year ago

Thank you for your patience in answering so many questions, and I am gonna shut down this issue. Best wishes!