microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.1k stars 113 forks source link

SpeechLM: How to resample phonemes' frame rate from 30ms to 20ms? #18

Closed Arrivederci closed 1 year ago

Arrivederci commented 1 year ago

Hi, thank you for your great work. According to the appendix of paper, it uses a kaldi model to convert audio into phonemes. I have trained a kaldi model with frame rate of 30ms. To generate the SpeechLM Base label (10ms), I just repeat each phoneme 3 times, it works fine. But the SpeechLM Large label (20ms) cannot be generated simply by repeat phonemes. Could you provide some details about this convertion?

zz12375 commented 1 year ago

Actually, The frame rate of the model is 20ms for both the base and the large models (determined by the CNN module). When you feed 10ms labels and set model.label_rate=100, the code will automatically adjust(downsample) them to 20ms. So it is equal to either set model.label_rate=100 with 10ms labels, or to set model.label_rate=50 with 20ms labels.

As for the 30ms labels, since the label_rate is an integer, you can 1) repeat phonemes 3 times and set model.label_rate=100, or 2) repeat phonemes 3 times then downsample them 2 times manually and set model.label_rate=50. The above two options should work for both base and large models. We use the second choice in our large model since it would require smaller files.

Hope the above information could help you.

Arrivederci commented 1 year ago

That's very helpful, thank you 😊

zz12375 commented 1 year ago

You are welcome.