debug help: training with pre-calculated phoneme durations

dbkest commented 1 month ago

Hello, I would like to ask a few questions about my experiment. I followed the training steps for the LJspeech dataset exactly, with the only modification being replacing LJspeech with my English data. This allowed me to generate high-quality speech. Building on this, I conducted another experiment where I used pre-calculated phoneme durations instead of using MAS to obtain the phoneme durations. I found that the generated audio spectrograms were good, with clear harmonic structures, but it was difficult to understand what was being said. Could you give me some debugging directions? Have you conducted this experiment before? I performed a few troubleshooting steps:

I used the same data with FastSpeech2, ruling out alignment issues. I also printed the data loaded by the Matcha-TTS dataloader, confirming that the phonemes and durations were aligned. Compared to FastSpeech2, the main difference in Matcha-TTS, aside from the decoder, is that the latter uses a prior loss. I am currently retraining with this loss turned off, but it has only just begun.

Thank you very much!

generated audio with model which trained with pre-calculated phoneme durations text: You're a chef. You're a young chef, yes?

The loss curves from the experiment are as follows, with the blue line representing the experiment using MAS and the gray line representing the experiment using pre-calculated durations.

shivammehta25 commented 1 month ago

Sorry for the delayed reponse I was away from the keyboard. I am glad it is resolved, just to confirm that you are aware of the extra blank token addition.

dbkest commented 1 month ago

Sorry for the delayed reponse I was away from the keyboard. I am glad it is resolved, just to confirm that you are aware of the extra blank token addition.

So sorry for forgetting to reply. After reading the literature you recommended, I have some understanding of the blank state. Its function seems to be that for the same phoneme, it can absorb different pronunciation variations, such as pronunciation duration. Theoretically, prosody should be better than pre-forced alignment methods like MFA, because for the same phoneme with different pronunciations, there are no additional states to absorb them.

shivammehta25 / Matcha-TTS

debug help: training with pre-calculated phoneme durations #106