Open UestcJay opened 2 months ago
what do you mean..
Is your LLM frozen or are you training any aspect of it?
yes, I froze my llm, I noticed that your input text is first converted into a phoneme sequence through the phonemizer library and provided to the speech synthesis model, but I directly use the hidden_state output by llm as input; between the two, is the former easier Training? have you ever tried discretizing text directly as input to the model?
I think since your input is not text but already some representation that should be able to capture hidden nuances of phonemization, it should be fine. It is definitely an easier mapping if the input is phonetised but the model should be able to learn. I am actually not sure, why the prior loss is so high. Did you try listening to the outputs of the model is it utter garbage? (prior loss being MSE can be a bit high sometimes)
Thanks for your such a quick reply! I generated the inference results of the model, which gt is
The Poveys ate all the fish they could and sometimes more than they enjoyed because on his sober days Hollins invariably started his round at the shop, and Constance had to buy for Maggie's sake .
This example is from the training set. The output of the model does not seem to be fully fitted... I have tried the same data and used the text phonetised to train matchatts, and it can be fitted. I also tried increasing the number of training epochs, but the gains were very small.
target.wav_and_model_output.wav.zip
Then, I would have to believe that the hidden representations might not capture what is required to synthesise speech. I am not sure what would be an easy fix to this, perhaps train some part of the output embeddings using LoRA?
Thanks for your great work. Recently, I am using the hidden_state output from a large language model as the input of the matcha_tts encoder for training. I have fit a sample tens of thousands of times, but the loss is still very large, especially the priority_loss has always been between 1-2. Is there a solution to this problem? ?