Metallic / Robotic sound

I'm trying to train the VocGan model using two stages: STFT-Pretraining & Adversarial Loss training (+STFT Loss), but I face with metallic/robotic speech sound problem. If I training MelGan, then I get nearly-normal speech in about ~100 epochs, but VocGan demonstrating significantly worse results even with more epochs (100, 200, 300, ...).

Is it normal? Or maybe I just need to wait more time?

If it is not normal, what I probably need to check in my model and pipeline? (I forked your repo, but I needed to adapt the model to work with 200 hop_length and 16 000 sampling_rate).

rishikksh20 / VocGAN

Metallic / Robotic sound #9