Inconsistent Voice Timbre in Synthesized Speech

I guess that your training dataset maybe not very good at speech quality. the samples from the same speaker may contains many kind of styles, or , speechs from different speaker had been wrongly marked as from a single one. so when you synthesize different sentence with the "same speaker", it may not consistant in speech style or even in timbre. I had the same problem as you said because I did not get good training dataset also. all in all, the so called "Large Model", its quality depends also on the data distribution but not only the data large count.

neonbjb / tortoise-tts

Inconsistent Voice Timbre in Synthesized Speech #781