First of all, I would like to express my sincere gratitude to the authors. This is an excellent piece of work! I have used ConvNext_TTS, and its synthesis quality is impressive, with very fast inference speed.
I trained the model on a roughly 300-hour dataset of both Chinese and English. However, the synthesized speech occasionally has a sudden hoarseness on individual words, and increasing the number of training epochs does not seem to resolve the issue. I've trained for approximately 4M steps, but the problem persists.
For example, the last word of this speech segment seems to lack properly generated harmonics.
baker_004.zip
First of all, I would like to express my sincere gratitude to the authors. This is an excellent piece of work! I have used ConvNext_TTS, and its synthesis quality is impressive, with very fast inference speed.
I trained the model on a roughly 300-hour dataset of both Chinese and English. However, the synthesized speech occasionally has a sudden hoarseness on individual words, and increasing the number of training epochs does not seem to resolve the issue. I've trained for approximately 4M steps, but the problem persists.
For example, the last word of this speech segment seems to lack properly generated harmonics. baker_004.zip