seungwonpark / melgan

MelGAN vocoder (compatible with NVIDIA/tacotron2)
http://swpark.me/melgan/
BSD 3-Clause "New" or "Revised" License
632 stars 119 forks source link

Use mel-gan as an universal vocoder #15

Open wade3han opened 4 years ago

wade3han commented 4 years ago

Hello, thanks for your nice implementation of mel-gan.

I guess mel-gan can be used as the universal vocoder, and I thought there were a mention about multi-speaker training scheme in the original paper. Have you ever tried multi-speaker setting? It might be really useful if it can be an universal vocoder similar like this.

seungwonpark commented 4 years ago

Hi, @wade3han

Yes, we're trying to train MelGAN with a multi-speaker dataset. Though the authors have mentioned that it can be generalized to unseen speakers, we'll need to test how many speakers are required to train such universal vocoder.

seungwonpark commented 4 years ago

btw I won't be able to release pretrained model on multi-speaker setting due to conflict of interest.

It'll be really appreciated if someone could upload their own pretrained model with multi-speaker setting. (The official implementation contains multi speaker pretrained model, however, it's not compatible with NVIDIA/tacotron2: both stft function and mel filterbank is different.)

ghost commented 4 years ago

@wade3han you can have a look here Real-Time-Voice-Cloning video

by-the-way @seungwonpark , nice use of we're & we'll. you got the grammar in you ;)

seungwonpark commented 4 years ago

I found it really hard to get a universal vocoder, even when I use a multi-speaker dataset. It doesn't generalize to unseen speakers. (I can't show the audio samples, sorry for that.) Any more ideas/insights regarding this issue?

wade3han commented 4 years ago

I've also tried with our dataset and it shows that it was able to generalize on unseen speakers. An interesting part was even I trained the vocoder with Korean dataset, the model could deal with english. I couldn't release our model or share trained result because of credentials, sorry for that.

seungwonpark commented 4 years ago

@wade3han Sounds great. Thanks for letting me know! Instead of sharing audio samples(since we can't), can you kindly describe how the results were? Did they have no audible artifact even when we hear it with headphones? When compared with http://swpark.me/melgan/, which epoch best represents your results?

wade3han commented 4 years ago

Most of samples have similar quality as samples from 6400 epochs, however I figured out that vocoder was vulnerable at background noise(such as clapping sound)

xxoospring commented 4 years ago

In my experiment, just shift to your own data and continue training when base model (for example, training from LJSpeech) training is done, you will get a result good as base model.

Immortalin commented 4 years ago

@xxoospring so continue training with multispeaker using the existing Ljspeech model?

ioannist commented 3 years ago

Most of samples have similar quality as samples from 6400 epochs, however I figured out that vocoder was vulnerable at background noise(such as clapping sound)

Same here. I paused training around 8000 and added more wav files to give the model more data. However, the new files had some clapping here and there. The model only generated clapping 7000 epochs later.

Btw, it's such a shame that there is no pre-trained model and no multi-speaker model in this repo. It would save the environment a lot of unnecessary GPU emissions.