MB Melgan training issue

kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

https://kan-bayashi.github.io/ParallelWaveGAN/

MIT License

1.57k stars 343 forks source link

MB Melgan training issue #234

Open gafsd opened 4 years ago

gafsd commented 4 years ago

Hello, I trained MB Melgan v3 700k steps on small 570 utterance single-speaker dataset and output very robotic and loss curves look not so good. What am I doing wrong? I also have bad results when resuming from checkpoint too. Screenshot 2020-11-09 185922

kan-bayashi commented 4 years ago

I could not train MB-MelGAN with small dataset (I tried with arctic, around 1000 utts). If you cannot increase the dataset, PWG adaptation will work better.

kan-bayashi commented 4 years ago

Or you can try this tip: https://github.com/kan-bayashi/ParallelWaveGAN/issues/171#issuecomment-676765007

gafsd commented 4 years ago

OK, I try with PWG adaptation first. For the PWG discriminator with MB-Melgan, do you have sample config/results? MB-Melgan is much faster so it would be good to have this.

gafsd commented 4 years ago

I try with PWG adaptation but how many iterations before good output? Loss goes down after only a few thousand iterations but output sounds much worse (maybe because input fastspeech2 model wrong pitch?)

kan-bayashi commented 4 years ago

For the PWG discriminator with MB-Melgan, do you have sample config/results?

No. If you want to do it, it is necessary to slightly modify the code to load only generator parameters.

I try with PWG adaptation but how many iterations before good output?

In my case, I train 50k-100k iters with the discriminator from the first iters.

gafsd commented 4 years ago

Hmm, I tried just with parallel_wavegan.v3.yaml config and just --resume from checkpoint but results much worse after 130k iters even though curves look better maybe. Should I be freezing some layers or using a different discriminator? How are you adapting? Capture

kan-bayashi commented 4 years ago

No. Please try v1. v3 utilizes MelGAN discriminator, which makes it difficult to perform adaptation, as you see the discussion in #171.

gafsd commented 4 years ago

You are right, v1 is training better. I will try after 100k params but looking better so far. Is it possible to resume only certain layers like in espnet? I try to create my own mb melgan with pwg discriminator but state dicts are different when I try to load mb-melgan checkpoint with mixed conf.

kan-bayashi commented 4 years ago

You can change this part to load only generator params: https://github.com/kan-bayashi/ParallelWaveGAN/blob/53d14969089b3d3229fe6bfce221234f25a9d836/parallel_wavegan/bin/train.py#L142-L148

gafsd commented 4 years ago

So I have adapted PWG for 85k iters and MB-PWG 60k iters. I found that PWG sounds ok pitch-wise (I have pitch problems with universal PWG) but lots of background noise and static which makes it sound bad even though the pitch is better than pretrained PWG. PWG

With MB-PWG the pitch is bad and lots of robotic noise so makes it hard to use I think. I used --pretrain not --resume if that makes a difference. MB-PWG

Any idea on how I can improve this result?

gafsd commented 3 years ago

Could it be problem with fastspeech2 training? I just notice that fs2 espnet use vocoder as part of teacher forcing (also default griffin lim iters is 4 which seems low). Should I train PWG first then use that vocoder for fs2 training in espnet (stage 7)? I have been training fs2 first then finetuning vocoder.