Open gafsd opened 4 years ago
I could not train MB-MelGAN with small dataset (I tried with arctic, around 1000 utts). If you cannot increase the dataset, PWG adaptation will work better.
Or you can try this tip: https://github.com/kan-bayashi/ParallelWaveGAN/issues/171#issuecomment-676765007
OK, I try with PWG adaptation first. For the PWG discriminator with MB-Melgan, do you have sample config/results? MB-Melgan is much faster so it would be good to have this.
I try with PWG adaptation but how many iterations before good output? Loss goes down after only a few thousand iterations but output sounds much worse (maybe because input fastspeech2 model wrong pitch?)
For the PWG discriminator with MB-Melgan, do you have sample config/results?
No. If you want to do it, it is necessary to slightly modify the code to load only generator parameters.
I try with PWG adaptation but how many iterations before good output?
In my case, I train 50k-100k iters with the discriminator from the first iters.
Hmm, I tried just with parallel_wavegan.v3.yaml config and just --resume from checkpoint but results much worse after 130k iters even though curves look better maybe. Should I be freezing some layers or using a different discriminator? How are you adapting?
No. Please try v1. v3 utilizes MelGAN discriminator, which makes it difficult to perform adaptation, as you see the discussion in #171.
You are right, v1 is training better. I will try after 100k params but looking better so far. Is it possible to resume only certain layers like in espnet? I try to create my own mb melgan with pwg discriminator but state dicts are different when I try to load mb-melgan checkpoint with mixed conf.
You can change this part to load only generator params: https://github.com/kan-bayashi/ParallelWaveGAN/blob/53d14969089b3d3229fe6bfce221234f25a9d836/parallel_wavegan/bin/train.py#L142-L148
So I have adapted PWG for 85k iters and MB-PWG 60k iters. I found that PWG sounds ok pitch-wise (I have pitch problems with universal PWG) but lots of background noise and static which makes it sound bad even though the pitch is better than pretrained PWG.
With MB-PWG the pitch is bad and lots of robotic noise so makes it hard to use I think. I used --pretrain not --resume if that makes a difference.
Any idea on how I can improve this result?
Could it be problem with fastspeech2 training? I just notice that fs2 espnet use vocoder as part of teacher forcing (also default griffin lim iters is 4 which seems low). Should I train PWG first then use that vocoder for fs2 training in espnet (stage 7)? I have been training fs2 first then finetuning vocoder.
Hello, I trained MB Melgan v3 700k steps on small 570 utterance single-speaker dataset and output very robotic and loss curves look not so good. What am I doing wrong? I also have bad results when resuming from checkpoint too.