Should I train for each speaker by MB-MelGAN?

kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

https://kan-bayashi.github.io/ParallelWaveGAN/

MIT License

1.56k stars 342 forks source link

Should I train for each speaker by MB-MelGAN? #249

Open n5-suzuki opened 3 years ago

n5-suzuki commented 3 years ago

Hi.Thanks for greatful work! I traied to train Tacotron2 + MB-MelGAN in speaker_A of original dataset. Then I could get good voice. And I also trained Tacotron2 + MB-MelGAN in speaker_B of original dataset. Then I also could get good voice. Then I have a question. If I want to synthesize voice of speaker_C, shoud I train speaker_C only? If possible, I want to train speaker_A, speaker_B, and speaker_C together. If all mel's settings should be the same, I can train together? And if it is possible, can I train male and female voices together? Thanks!

OnceJune commented 3 years ago

for tacotron, use speaker embedding to train multi-speaker model is ok, so I think mbmelgan can also be adjusted to take speaker embedding as input, but you might need to modify the model structure. briefly, the model needs some knowledge to judge which mel belongs to which speaker.

kan-bayashi commented 3 years ago

Sorry for the late reply. For the vocoder, multi-speaker training is possible without any modification. However, in my experiments, the speaker-dependent or gender-dependent models are better. For the txt2mel model, you need to introduce some speaker information to train multi-speaker model. In the case of ESPNet, we used pretrained x-vector or global style token to deal with multi-speaker condition.