Open unilight opened 4 years ago
Could you paste your tensorboard log?
I am not sure what the correct curve of MelGAN should look like, but the curve of PWG looks correct.
It seems to be fine but the discriminator loss is a little bit small.
I'm not sure this is the problem with the amount of training data.
(Additionally, I'm not sure which is better PWG or MB-MelGAN in terms of the quality in general.)
Did you try with a different lambda_adv
?
It seems to be fine but the discriminator loss is a little bit small. I'm not sure this is the problem with the amount of training data. (Additionally, I'm not sure which is better PWG or MB-MelGAN in terms of the quality in general.) Did you try with a different
lambda_adv
?
Hi @kan-bayashi, I have a similar task with @unilight (multi-speaker&few samples) It seems the clarity is good but the similarity is poor, and now I'm trying to train the parallel-waveGAN on only one speaker, but the loss is almost not decreasing. I would like to ask what parameters could I adjust to make the training continue to converge?
the loss is almost not decreasing
Which loss value do you mean? The discriminator loss of PWG keeps the same value around 0.5.
the loss is almost not decreasing
Which loss value do you mean? The discriminator loss of PWG keeps the same value around 0.5.
The figure is my tensorboard log. The generator loss drops in 1M is because I change lambda_adv
to 2. But it seems doesn't work.
@kan-bayashi I've tried several times by modifying learning rate and lambda_adv, but it always becomes overfitting after 1M, any idea to avoid this? Thanks.
I think in the case of PWG 1M iterations are enough. Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.
I think in the case of PWG 1M iterations are enough. Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.
Thanks for replying @kan-bayashi. I'm now training single speaker models, some speakers are clear, some have louder noise. As the result shows in single_speaker_inference_1000k.zip , the quality is not good enough. I'm not sure which is the main reason, lack of training, or limit of training data(only 70 sentences for each speaker), or the quality of training data.
I'm now training single speaker models
Did you use pretrained model? 70 utterances are not enough to train from scratch. I think it is better to consider the use of pretrained scheme.
I'm now training single speaker models
Did you use pretrained model? 70 utterances are not enough to train from scratch. I think it is better to consider the use of pretrained scheme.
The parameter of mel-spectrum (ZH 16kHz) is not fit for the pre-trained model. So I have to train it from scratch. I used to train the model using multi-speaker and multi-language for 840k, then finetune on a single speaker, the result is upload above (mel spectrum is generated by a voice conversion model). From yesterday, I followed this advice:
Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.
It is 80k iters now, the quality hasn't reached the level of 1000k ones. I'll continue training to see when it achieves the best performance.
@kan-bayashi When you trained PWGAN up to 50k with pretrained, when do you turn on the discriminator? From how many steps, or from the start? When I finetune female voice on LJSpeech it almost always sounds good after 20k steps, but male sounds bad.
In my case, I use the discriminator from the first iter. Both male and female adaptation with the pretrained female model works well but it depends on the speaker's similarity. If you can have a good male speaker dataset, it is better to consider the creation of the single male speaker model.
@kan-bayashi VCTK has some male speakers, can we finetune single speaker male on multi speaker?
In my adaptation experiment using female dataset, multi-speaker-based adaptation is worse than single female speaker based adaptation. But if you do not have male model, it is worthwhile to try.
I also meet that using MB-melgan to generate the noisy speech, especially in high frequency. I wonder if the Discriminator can not be downsample because of the suband processing.
Hi, first of all, big thanks to all the people who helped develop the MelGAN model! The inference speed is super fast!
I modified the configs from the
VCTK
recipe to train a parallel_wavegan.v1 and a multi_band_melgan.v2 on theVCC2018
dataset, which contained 12 speakers * 81 utterances = 972 training utterances. The ana-syn samples are as follows. PWG (400k steps): https://drive.google.com/drive/folders/1buveb7V_nz7reWNCQsy2loxXjynVItkV?usp=sharing MelGAN (1000k steps): https://drive.google.com/drive/folders/1X5hrryxRL_txNtyE48Xw1gpzN7T24cB3?usp=sharing I found that MelGAN generates much more noisy speech samples, while PWG is pretty stable. I listened to the official samples on VCTK and found a similar trend: MelGAN is a little bit worse than PWG. (less noisy due to larger dataset?) Is this a known issue? Any suggestions on how to improve this?