Closed Arrivederci closed 1 year ago
Actually, The frame rate of the model is 20ms for both the base and the large models (determined by the CNN module). When you feed 10ms labels and set model.label_rate=100
, the code will automatically adjust(downsample) them to 20ms. So it is equal to either set model.label_rate=100
with 10ms labels, or to set model.label_rate=50
with 20ms labels.
As for the 30ms labels, since the label_rate
is an integer, you can 1) repeat phonemes 3 times and set model.label_rate=100
, or 2) repeat phonemes 3 times then downsample them 2 times manually and set model.label_rate=50
. The above two options should work for both base and large models. We use the second choice in our large model since it would require smaller files.
Hope the above information could help you.
That's very helpful, thank you 😊
You are welcome.
Hi, thank you for your great work. According to the appendix of paper, it uses a kaldi model to convert audio into phonemes. I have trained a kaldi model with frame rate of 30ms. To generate the SpeechLM Base label (10ms), I just repeat each phoneme 3 times, it works fine. But the SpeechLM Large label (20ms) cannot be generated simply by repeat phonemes. Could you provide some details about this convertion?