Closed YWMditto closed 1 year ago
By the way, I have seen one typo in your README.md:
Hello again, your work is so greate, and let us see the greate understanding power of the pretrained models in both audio and text, and to further increase the impact of this work, we want to transfer it into Chinese language. So we would like to ask more training details, mostly about the data preprocess part, it would be very appreciated for you to answer these questions!
# text
SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.ltr.txt
SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.phn.txt
SpeechLM/dataset/LibriLM/phone_unit/align_lexicon.txt (we notice it is directly downloaded so we wonder that how could we generate this file by ourselves? Does this is derived when training acoustic models using kaldi?)
SpeechLM/speechlm/data_process/phoneme_tokenizer/mean5_and_std25_sil14_spn32.dict
SpeechLM/dataset/LibriSpeech/phone_unit/dict.phn.txt
Are these dict files generated when training asr models using kaldi? Or some other scripts? Could these scripts be shared?
Very thankful for your answer! It means a lot to us!
Thanks, YWMditto, we have fixed the typo. 1) Regarding "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data", I recommend you to follow https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh.
2) Regarding the following files:
dict.ltr.txt
or dict.phn.txt
. They are just all the symbols that appear in your data. For an example of English, ltr
means letters
, and you can just sort all the symbols of your data to obtain it. Alternatively, you can use fairseq-preprocess to generate the dict files.align_lexicon.txt
. Yes it is obtained by kaldi, but it is generated at the language preparing stage, not the acoustic model training stage (also see run.sh). For convenience we just provided it.mean5_and_std25_sil14_spn32.dict
. It stores the mean/variance of the duration (how many continuous frames) of each phoneme. You can see that all the values are set to 5/2.5 for each phoneme, except for the SpeechLM/dataset/LibriSpeech/phone_unit/dict.phn.txt
. Just the same meaning as SpeechLM/dataset/LibriLM/phone_unit/bin-idx/dict.phn.txt
. The difference is whether the data are stored as symbols or id.Sorry for the late response, I help the information could help you!
OkOk, thanks a lot, I think these will help a lot, I will follow these suggestions so please do not shut down this issue so I can maybe ask some other questions.
Thanks again for your reply!
OK, you can also contact me by email if you want. See my github profile.
Ok! Thanks very much!
Thank you for your patience in answering so many questions, and I am gonna shut down this issue. Best wishes!
Hello, congratulations on your success in this paper!
I want to ask that is there any training scripts for 'Phone-unit tokenizer for speech' part, which is using kaldi recipe to "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data".
I'm new at speech processing, especially at using traditional HMM model which is used by kaldi, so it would be very thankful for you to answer.
Thanks a lot!