Hello, and thank you for your open-source contribution.
I have a question regarding the dataset used for training the model. How much mixed Chinese and English speech data was used? Specifically, how many hours of Chinese speech data and how many hours of English speech data were included? I'm asking because I'd like to assess the model's performance based on this information.
I feel that the natural fluency of the generated Chinese speech is not very high. I'm wondering if this might be due to insufficient training data. I'm considering whether adding more data to the training set could improve this issue.
Additionally, I'd like to know if you plan to develop a TTS model based on diffusion techniques, or if you might consider incorporating the technology used in this repository: https://github.com/shivammehta25/Matcha-TTS for training.
Hello, and thank you for your open-source contribution.
I have a question regarding the dataset used for training the model. How much mixed Chinese and English speech data was used? Specifically, how many hours of Chinese speech data and how many hours of English speech data were included? I'm asking because I'd like to assess the model's performance based on this information.
I feel that the natural fluency of the generated Chinese speech is not very high. I'm wondering if this might be due to insufficient training data. I'm considering whether adding more data to the training set could improve this issue.
Additionally, I'd like to know if you plan to develop a TTS model based on diffusion techniques, or if you might consider incorporating the technology used in this repository: https://github.com/shivammehta25/Matcha-TTS for training.
Looking forward to your response.