microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.09k stars 113 forks source link

[SpeechLM] About phoneme tokenizer in detail? #40

Closed yuseungwoo closed 1 year ago

yuseungwoo commented 1 year ago

First of all, Thanks your great works and code

I am studying SpeechLM and found some curious things about training and inference.

  1. Can you guide which stage did you use for learning? below #L155 as I expected? [https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh#L155]

  2. Can you guide which decoder is used for Pseudo label generation and share you command ?
    steps/decode_fmllr.sh or online2-wav-gmm-latgen-faster directly?

Best Regards

zz12375 commented 1 year ago

Sorry for the late response.

  1. Yes, as you expected. We trained two Phoneme tokenizers in our paper, which is a GMM-HMM model using 100-hour data for the Base setting, and a DNN-HMM model using 960-hour data for the Large setting. The GMM-HMM model is exactly 'tri4b' (after stage 13). The DNN-HMM model is exactly the chain model obtained after running the whole script (after the last stage).

  2. steps/decode_fmllr.sh for the GMM-HMM model.