kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.56k stars 342 forks source link

Noisy speech samples of (Multi-band) MelGAN on (small) multispeaker dataset #167

Open unilight opened 4 years ago

unilight commented 4 years ago

Hi, first of all, big thanks to all the people who helped develop the MelGAN model! The inference speed is super fast!

I modified the configs from the VCTK recipe to train a parallel_wavegan.v1 and a multi_band_melgan.v2 on the VCC2018 dataset, which contained 12 speakers * 81 utterances = 972 training utterances. The ana-syn samples are as follows. PWG (400k steps): https://drive.google.com/drive/folders/1buveb7V_nz7reWNCQsy2loxXjynVItkV?usp=sharing MelGAN (1000k steps): https://drive.google.com/drive/folders/1X5hrryxRL_txNtyE48Xw1gpzN7T24cB3?usp=sharing I found that MelGAN generates much more noisy speech samples, while PWG is pretty stable. I listened to the official samples on VCTK and found a similar trend: MelGAN is a little bit worse than PWG. (less noisy due to larger dataset?) Is this a known issue? Any suggestions on how to improve this?

kan-bayashi commented 4 years ago

Could you paste your tensorboard log?

unilight commented 4 years ago

image

I am not sure what the correct curve of MelGAN should look like, but the curve of PWG looks correct.

kan-bayashi commented 4 years ago

It seems to be fine but the discriminator loss is a little bit small. I'm not sure this is the problem with the amount of training data. (Additionally, I'm not sure which is better PWG or MB-MelGAN in terms of the quality in general.) Did you try with a different lambda_adv?

Approximetal commented 4 years ago

It seems to be fine but the discriminator loss is a little bit small. I'm not sure this is the problem with the amount of training data. (Additionally, I'm not sure which is better PWG or MB-MelGAN in terms of the quality in general.) Did you try with a different lambda_adv?

Hi @kan-bayashi, I have a similar task with @unilight (multi-speaker&few samples) It seems the clarity is good but the similarity is poor, and now I'm trying to train the parallel-waveGAN on only one speaker, but the loss is almost not decreasing. I would like to ask what parameters could I adjust to make the training continue to converge?

kan-bayashi commented 4 years ago

the loss is almost not decreasing

Which loss value do you mean? The discriminator loss of PWG keeps the same value around 0.5.

Approximetal commented 4 years ago

the loss is almost not decreasing

Which loss value do you mean? The discriminator loss of PWG keeps the same value around 0.5.

The figure is my tensorboard log. The generator loss drops in 1M is because I change lambda_adv to 2. But it seems doesn't work.

Approximetal commented 4 years ago

@kan-bayashi I've tried several times by modifying learning rate and lambda_adv, but it always becomes overfitting after 1M, any idea to avoid this? Thanks.

kan-bayashi commented 4 years ago

I think in the case of PWG 1M iterations are enough. Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.

Approximetal commented 4 years ago

I think in the case of PWG 1M iterations are enough. Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.

Thanks for replying @kan-bayashi. I'm now training single speaker models, some speakers are clear, some have louder noise. As the result shows in single_speaker_inference_1000k.zip , the quality is not good enough. I'm not sure which is the main reason, lack of training, or limit of training data(only 70 sentences for each speaker), or the quality of training data.

kan-bayashi commented 4 years ago

I'm now training single speaker models

Did you use pretrained model? 70 utterances are not enough to train from scratch. I think it is better to consider the use of pretrained scheme.

Approximetal commented 4 years ago

I'm now training single speaker models

Did you use pretrained model? 70 utterances are not enough to train from scratch. I think it is better to consider the use of pretrained scheme.

The parameter of mel-spectrum (ZH 16kHz) is not fit for the pre-trained model. So I have to train it from scratch. I used to train the model using multi-speaker and multi-language for 840k, then finetune on a single speaker, the result is upload above (mel spectrum is generated by a voice conversion model). From yesterday, I followed this advice:

Why don’t you try adaptation using good single speaker model? In my experiments, it works well with only 50k iters.

It is 80k iters now, the quality hasn't reached the level of 1000k ones. I'll continue training to see when it achieves the best performance.

ZDisket commented 4 years ago

@kan-bayashi When you trained PWGAN up to 50k with pretrained, when do you turn on the discriminator? From how many steps, or from the start? When I finetune female voice on LJSpeech it almost always sounds good after 20k steps, but male sounds bad.

kan-bayashi commented 4 years ago

In my case, I use the discriminator from the first iter. Both male and female adaptation with the pretrained female model works well but it depends on the speaker's similarity. If you can have a good male speaker dataset, it is better to consider the creation of the single male speaker model.

ZDisket commented 4 years ago

@kan-bayashi VCTK has some male speakers, can we finetune single speaker male on multi speaker?

kan-bayashi commented 4 years ago

In my adaptation experiment using female dataset, multi-speaker-based adaptation is worse than single female speaker based adaptation. But if you do not have male model, it is worthwhile to try.

hyysam commented 4 years ago

I also meet that using MB-melgan to generate the noisy speech, especially in high frequency. I wonder if the Discriminator can not be downsample because of the suband processing.