shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
746 stars 95 forks source link

some question about prior_loss #96

Open UestcJay opened 2 months ago

UestcJay commented 2 months ago

Thanks for your great work. Recently, I am using the hidden_state output from a large language model as the input of the matcha_tts encoder for training. I have fit a sample tens of thousands of times, but the loss is still very large, especially the priority_loss has always been between 1-2. Is there a solution to this problem? ?

UestcJay commented 2 months ago

what do you mean..

shivammehta25 commented 2 months ago

Is your LLM frozen or are you training any aspect of it?

UestcJay commented 2 months ago

yes, I froze my llm, I noticed that your input text is first converted into a phoneme sequence through the phonemizer library and provided to the speech synthesis model, but I directly use the hidden_state output by llm as input; between the two, is the former easier Training? have you ever tried discretizing text directly as input to the model?

shivammehta25 commented 2 months ago

I think since your input is not text but already some representation that should be able to capture hidden nuances of phonemization, it should be fine. It is definitely an easier mapping if the input is phonetised but the model should be able to learn. I am actually not sure, why the prior loss is so high. Did you try listening to the outputs of the model is it utter garbage? (prior loss being MSE can be a bit high sometimes)

UestcJay commented 2 months ago

Thanks for your such a quick reply! I generated the inference results of the model, which gt is The Poveys ate all the fish they could and sometimes more than they enjoyed because on his sober days Hollins invariably started his round at the shop, and Constance had to buy for Maggie's sake . This example is from the training set. The output of the model does not seem to be fully fitted... I have tried the same data and used the text phonetised to train matchatts, and it can be fitted. I also tried increasing the number of training epochs, but the gains were very small. target.wav_and_model_output.wav.zip

shivammehta25 commented 2 months ago

Then, I would have to believe that the hidden representations might not capture what is required to synthesise speech. I am not sure what would be an easy fix to this, perhaps train some part of the output embeddings using LoRA?