I'm trying to train the VocGan model using two stages: STFT-Pretraining & Adversarial Loss training (+STFT Loss), but I face with metallic/robotic speech sound problem. If I training MelGan, then I get nearly-normal speech in about ~100 epochs, but VocGan demonstrating significantly worse results even with more epochs (100, 200, 300, ...).
Is it normal? Or maybe I just need to wait more time?
If it is not normal, what I probably need to check in my model and pipeline? (I forked your repo, but I needed to adapt the model to work with 200 hop_length and 16 000 sampling_rate).
I'm trying to train the VocGan model using two stages: STFT-Pretraining & Adversarial Loss training (+STFT Loss), but I face with metallic/robotic speech sound problem. If I training MelGan, then I get nearly-normal speech in about ~100 epochs, but VocGan demonstrating significantly worse results even with more epochs (100, 200, 300, ...).
Is it normal? Or maybe I just need to wait more time?
If it is not normal, what I probably need to check in my model and pipeline? (I forked your repo, but I needed to adapt the model to work with 200
hop_length
and 16 000sampling_rate
).