microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.22k stars 114 forks source link

SpeechLM #46

Closed blueblue-bubble closed 1 year ago

blueblue-bubble commented 1 year ago

Hello,thanks for your great work.However, I want to ask you some question. I notice that there is a model namedFast Text2Unit Model in the item SpeechLM, but I didn't find the usage about the model. I want to know if the model is used for transforem the text which is transformed from speech to units?

blueblue-bubble commented 1 year ago

What I mean is if the Fast Text2Unit Model is the HMM model under Kaldi recipe, and it used for decoding the unpaired speech and get the aligned phonemes from the lattice.

zz12375 commented 1 year ago

Hi @blueblue-bubble, the Fast Text2Unit Model is used for text-to-hidden unit transformation, it is modified from Fastspeech (a non-autoregressive tts model). The Fast Text2Unit Model is the so-called "Hidden-unit tokenizer for text" in the paper (see appendix).

Note that the kaldi HMM model is not provided in this repo, you can follow the kaldi recipe.