jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.83k stars 1.25k forks source link

Multis-speaker identity degradation #123

Open NikitaKononov opened 1 year ago

NikitaKononov commented 1 year ago

Hi! I've encountered a problem I have multi speaker dataset.

If I train a separate model for speaker (single speaker model) - prosody, speed, intonations, timbre, identity are good (for the speaker) And if I train multi-speaker model - only the timbre for each speaker survives... Identity degrades a lot comparing to single speaker. Speed, prosody, intonations are like... becoming kind of average, kinda familiar to each other

Is it okay? Or not. It seems to be quite logical - we train same generator for all speakers, only speaker embeddings are separate, so we achieve a kind of average of all speakers But maybe there is a solution to save more identity, can someone suggest? Or maybe other models (grad tts, fastspeech2 etc) perform better with a task of identity saving in multi speaker mode.

Thanks

nikich340 commented 1 year ago

I can say that FastSpeech, FastPitch etc models has worse quality (no variety) than vits. What you can try: half-training single model with some "average" voice and then use it for fine-tuning to get individual models faster.

heesuju commented 1 year ago

I have the same problem regarding identity degradation. I'm training with 22 different speakers, but one of them is drowning out the rest. I think this is because that one speaker has more than 7 hours worth of data while the rest has around 1-3 hours. Regarding this, I would like to ask if identity degradation is an actual problem with VITS and other tts models. If so, should I try balancing out the number of datasets? I also tried to follow the above solution, training a multi-speaker model from a single-speaker model, but there is an error indicating that the channels do not match. My guess is that channels mean the number of speakers embedded in the model. How would I add more speakers to a vits model if the number of speakers is always different?

nikich340 commented 1 year ago

I have the same problem regarding identity degradation. I'm training with 22 different speakers, but one of them is drowning out the rest. I think this is because that one speaker has more than 7 hours worth of data while the rest has around 1-3 hours. Regarding this, I would like to ask if identity degradation is an actual problem with VITS and other tts models. If so, should I try balancing out the number of datasets? I also tried to follow the above solution, training a multi-speaker model from a single-speaker model, but there is an error indicating that the channels do not match. My guess is that channels mean the number of speakers embedded in the model. How would I add more speakers to a vits model if the number of speakers is always different?

You can't do this. Model is initialized with N speakers, then you train and layers amount can't be changed. But, there is no real need in multimodel, you can create "pretrain model" on some good dataset with average voice and use it later to fine-tune on other datasets until satisfying results.

heesuju commented 1 year ago

I have the same problem regarding identity degradation. I'm training with 22 different speakers, but one of them is drowning out the rest. I think this is because that one speaker has more than 7 hours worth of data while the rest has around 1-3 hours. Regarding this, I would like to ask if identity degradation is an actual problem with VITS and other tts models. If so, should I try balancing out the number of datasets? I also tried to follow the above solution, training a multi-speaker model from a single-speaker model, but there is an error indicating that the channels do not match. My guess is that channels mean the number of speakers embedded in the model. How would I add more speakers to a vits model if the number of speakers is always different?

You can't do this. Model is initialized with N speakers, then you train and layers amount can't be changed. But, there is no real need in multimodel, you can create "pretrain model" on some good dataset with average voice and use it later to fine-tune on other datasets until satisfying results.

Oh I thought you were talking about finetuning a multi-speaker model. My mistake. Training separate models for each speaker does sound like a good idea. Everything is crystal clear now. Thank you for your help!

If I may ask another question, is identity degradation an actual problem in VITS or other tts models? I don't think this problem was mentioned in any of the papers that I have checked. I'd like to confirm whether this is just my imagination or an officially recognized problem.

EDIT: VITS2 improved multi-speaker similarity, meaning that multi-speaker identity degradation is an actual problem.