Open wade3han opened 5 years ago
Hi, @wade3han
Yes, we're trying to train MelGAN with a multi-speaker dataset. Though the authors have mentioned that it can be generalized to unseen speakers, we'll need to test how many speakers are required to train such universal vocoder.
btw I won't be able to release pretrained model on multi-speaker setting due to conflict of interest.
It'll be really appreciated if someone could upload their own pretrained model with multi-speaker setting. (The official implementation contains multi speaker pretrained model, however, it's not compatible with NVIDIA/tacotron2: both stft function and mel filterbank is different.)
@wade3han you can have a look here Real-Time-Voice-Cloning video
by-the-way @seungwonpark , nice use of we're
& we'll
. you got the grammar in you
;)
I found it really hard to get a universal vocoder, even when I use a multi-speaker dataset. It doesn't generalize to unseen speakers. (I can't show the audio samples, sorry for that.) Any more ideas/insights regarding this issue?
I've also tried with our dataset and it shows that it was able to generalize on unseen speakers. An interesting part was even I trained the vocoder with Korean dataset, the model could deal with english. I couldn't release our model or share trained result because of credentials, sorry for that.
@wade3han Sounds great. Thanks for letting me know! Instead of sharing audio samples(since we can't), can you kindly describe how the results were? Did they have no audible artifact even when we hear it with headphones? When compared with http://swpark.me/melgan/, which epoch best represents your results?
Most of samples have similar quality as samples from 6400 epochs, however I figured out that vocoder was vulnerable at background noise(such as clapping sound)
In my experiment, just shift to your own data and continue training when base model (for example, training from LJSpeech) training is done, you will get a result good as base model.
@xxoospring so continue training with multispeaker using the existing Ljspeech model?
Most of samples have similar quality as samples from 6400 epochs, however I figured out that vocoder was vulnerable at background noise(such as clapping sound)
Same here. I paused training around 8000 and added more wav files to give the model more data. However, the new files had some clapping here and there. The model only generated clapping 7000 epochs later.
Btw, it's such a shame that there is no pre-trained model and no multi-speaker model in this repo. It would save the environment a lot of unnecessary GPU emissions.
Hello, thanks for your nice implementation of mel-gan.
I guess mel-gan can be used as the universal vocoder, and I thought there were a mention about multi-speaker training scheme in the original paper. Have you ever tried multi-speaker setting? It might be really useful if it can be an universal vocoder similar like this.