Open n5-suzuki opened 3 years ago
for tacotron, use speaker embedding to train multi-speaker model is ok, so I think mbmelgan can also be adjusted to take speaker embedding as input, but you might need to modify the model structure. briefly, the model needs some knowledge to judge which mel belongs to which speaker.
Sorry for the late reply. For the vocoder, multi-speaker training is possible without any modification. However, in my experiments, the speaker-dependent or gender-dependent models are better. For the txt2mel model, you need to introduce some speaker information to train multi-speaker model. In the case of ESPNet, we used pretrained x-vector or global style token to deal with multi-speaker condition.
Hi.Thanks for greatful work! I traied to train Tacotron2 + MB-MelGAN in speaker_A of original dataset. Then I could get good voice. And I also trained Tacotron2 + MB-MelGAN in speaker_B of original dataset. Then I also could get good voice. Then I have a question. If I want to synthesize voice of speaker_C, shoud I train speaker_C only? If possible, I want to train speaker_A, speaker_B, and speaker_C together. If all mel's settings should be the same, I can train together? And if it is possible, can I train male and female voices together? Thanks!