YWMditto commented 1 year ago

Hello, congratulations on your success in this paper!

I want to ask that is there any training scripts for 'Phone-unit tokenizer for speech' part, which is using kaldi recipe to "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data".

I'm new at speech processing, especially at using traditional HMM model which is used by kaldi, so it would be very thankful for you to answer.

Thanks a lot!

YWMditto commented 1 year ago

By the way, I have seen one typo in your README.md:

YWMditto commented 1 year ago

Hello again, your work is so greate, and let us see the greate understanding power of the pretrained models in both audio and text, and to further increase the impact of this work, we want to transfer it into Chinese language. So we would like to ask more training details, mostly about the data preprocess part, it would be very appreciated for you to answer these questions!

How could we get these dict files that are used when preparing the pre-training data?


# text
SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.ltr.txt
SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.phn.txt
SpeechLM/dataset/LibriLM/phone_unit/align_lexicon.txt （we notice it is directly downloaded so we wonder that how could we generate this file by ourselves? Does this is derived when training acoustic models using kaldi?）

SpeechLM/speechlm/data_process/phoneme_tokenizer/mean5_and_std25_sil14_spn32.dict

audio

SpeechLM/dataset/LibriSpeech/phone_unit/dict.phn.txt



Are these dict files generated when training asr models using kaldi? Or some other scripts? Could these scripts be shared?

Very thankful for your answer! It means a lot to us!

zz12375 commented 1 year ago

Thanks, YWMditto, we have fixed the typo. 1) Regarding "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data", I recommend you to follow https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh.

2) Regarding the following files:

dict.ltr.txt or dict.phn.txt. They are just all the symbols that appear in your data. For an example of English, ltr means letters, and you can just sort all the symbols of your data to obtain it. Alternatively, you can use fairseq-preprocess to generate the dict files.
align_lexicon.txt. Yes it is obtained by kaldi, but it is generated at the language preparing stage, not the acoustic model training stage (also see run.sh). For convenience we just provided it.
mean5_and_std25_sil14_spn32.dict. It stores the mean/variance of the duration (how many continuous frames) of each phoneme. You can see that all the values are set to 5/2.5 for each phoneme, except for the which is longer. The value is based on the observation of the duration of the True phoneme in the LibriSpeech train100 set. (Note that it may be different among different languages).
SpeechLM/dataset/LibriSpeech/phone_unit/dict.phn.txt. Just the same meaning as SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.phn.txt. The difference is whether the data are stored as symbols or id.
You don't need kaldi for generating the dicts. While you need kaldi to get the phone set (and phone data).

Sorry for the late response, I help the information could help you!

YWMditto commented 1 year ago

OkOk, thanks a lot, I think these will help a lot, I will follow these suggestions so please do not shut down this issue so I can maybe ask some other questions.

Thanks again for your reply!

zz12375 commented 1 year ago

OK, you can also contact me by email if you want. See my github profile.

YWMditto commented 1 year ago

Ok! Thanks very much!

YWMditto commented 1 year ago

Thank you for your patience in answering so many questions, and I am gonna shut down this issue. Best wishes!

microsoft / SpeechT5

SpeechLM: How to train 'Phone-unit tokenizer for speech' using kaldi? #23

audio