Peak Performance of Single vs. Multi-Speaker TTS Models: Seeking Insights and References

Hello. First and foremost, I'd like to express my gratitude to everyone who has provided various assistance. Thanks to you, I've been able to progress in my studies, create a VITS TTS model, and synthesize voices.

I have a question about the peak performance of the model based on the training data. By "peak performance," I mean the model's ability to read a variety of texts accurately.

Here's what I've observed during my training: A single-speaker model trained with 12 hours of Speaker A's data, when fine-tuned with 2 hours of Speaker B's data, seems to have a lower peak performance than the original model trained solely on Speaker A's data. (For fine-tuning, I used different data for B than A. Would the results of fine-tuning be better if I used the same text as A's data?)

Therefore, I believe securing a base model with high peak performance is crucial. While I think it would be beneficial to obtain and train with more high-quality single-speaker data, it has been challenging to acquire additional data in reality.

Suppose we have models trained in the following ways:

Single-speaker model1 trained with 10 hours of Speaker A's data. Multi-speaker model2 trained with model1 + 2 hours of Speaker B's data + 2 hours of Speaker C's data. Model3 fine-tuned with model1 + 2 hours of Speaker B's data. I suspect the order of peak performance would be model2 > model1 > model3. Is this correct? I've tried to find clear explanations or evidence, such as papers or articles, for this case but haven't been successful.

My concern is that when training a multi-speaker model, even if there's a lot of overall data, the data for each speaker is separated by speaker IDs. So, would the data for each speaker not contribute to the peak performance?

Testing and comparing all these cases would require a lot of resources and time, so I'm seeking assistance here. Does anyone have experience or related materials to share on this topic?

jaywalnut310 / vits

Peak Performance of Single vs. Multi-Speaker TTS Models: Seeking Insights and References #185