mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.27k stars 1.24k forks source link

Overtraining MelGAN causes high freq noise in results #532

Closed erogol closed 3 years ago

erogol commented 4 years ago

I realized that training MelGAN vocoder too long (>1M steps for universal vocoder and >850K steps for German) reduces the quality and introduces a high freq noise to the results.

I just call this out for anyone who is interested to dwell into it more. My guess is that it might be about the combination loss weights or later enabled Discriminator emphasizes the wrong quality of the voice.

Any thoughts?

george-roussos commented 4 years ago

I have noticed it too, but in ParallelWaveGAN. My guesses are that the generator either gets too weak and is penalised too often or that the network just does identity after 500K when both discriminator and generator have stabilized. For me, the peak has always been at around 500K and then it either starts to sound worse or doesn't impove at all. I remember initially some people tried to train both generator and discriminator from the start, but there was a lot of noise.

thorstenMueller commented 4 years ago

If kind of metalic voice is meant with high freq noise we encountered this issue too with german dataset and pwgan vocoder. Trained 460k tacotron2 steps and pwgan vocoder model with 925k steps.

https://soundcloud.com/thorsten-mueller-395984278/sets

george-roussos commented 4 years ago

@erogol is there any intuition as to why the vocoder configs have the same LR's both in gen and discriminator here in Mozilla TTS repo? I saw that the other implementations and papers mention smaller learning rates for discriminator when it starts training. Now I am trying ParallelWaveGAN with discriminator from 100K and discriminator learning rate 0.00005. Maybe the quality reduces because the discriminator gets too strong.

erogol commented 4 years ago

I've not tried different values. They were the first values I tried and they worked reasonably well.

thorstenMueller commented 3 years ago

Now I am trying ParallelWaveGAN with discriminator from 100K and discriminator learning rate 0.00005. Maybe the quality reduces because the discriminator gets too strong.

Hey @george-roussos, do you've any results with new training values yet? Is quality improving?

george-roussos commented 3 years ago

Hey, so I tried with slashed learning rate for discriminator and updating the LR's every 200K steps, along with a batch size of 8. At 500K steps, my speaker sounded much more stable and at 500K the spectrograms are still improving (though not a lot). I would suggest a run, it helped with my speaker.

thorstenMueller commented 3 years ago

Thanks for your reply.

george-roussos commented 3 years ago

No problem! It is single speaker, so it isn't exchangeable. And I cannot share samples unfortunately because the speaker only gave me permissions for testing, sorry. The problem that I had with PWGAN was that it sounded shaky, not metallic. Metallic is my issue in all MelGAN models, though (especially in breathing). With the reduced lr the shakiness is reduced and I can see the target spectrograms resemble the source better and at 500K steps the discriminator is still fighting to improve (but it has not dropped below 0.490, which was my problem before). I got the idea because I thought the discriminator kept converging too quick in the training and hindered the Generator's learning by penalizing it and then I saw in the original paper they used the same LR's.

Another thing I noticed is that the GAN generally tends to be much less forgiving in spectrogram mismatches from TTS. My TTS has a lot of vocal fry and this in the GAN translates to noise in the lower frequencies (which I also do not know how to solve).

I trained without mean-var and I would like to try with mean-var. Also, I trained with prefix OMP_NUM_THREADS=1 to avoid CPU bottlenecking. If anyone knows of any multispeaker datasets with less quality variation than LibriTTS, I can try training a new vocoder on that one (any language would do).

These are the only changes I made (and batch size 8):

// OPTIMIZER
    "epochs": 10000,                // total number of epochs to train.
    "wd": 0.0,                // Weight decay weight.
    "gen_clip_grad": -1,      // Generator gradient clipping threshold. Apply gradient clipping if > 0
    "disc_clip_grad": -1,     // Discriminator gradient clipping threshold.
    "lr_scheduler_gen": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_gen_params": {
        "gamma": 0.5,
        //"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
        "milestones": [200000, 400000, 600000]
    },
    "lr_scheduler_disc": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_disc_params": {
        "gamma": 0.5,
        //"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
        "milestones": [200000, 400000, 600000]
    },
    "lr_gen": 0.0001,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
    "lr_disc": 0.00005,
thorstenMueller commented 3 years ago

Thanks for your detailed reply. Due we don't have issues with shakiness, but more with metallic sound (*GAN) maybe we'll try WaveRNN to get rid of this.

george-roussos commented 3 years ago

No problem! You can also try WaveGrad. I got some nice results with it and convergence only needs 2 days 😀 I really do not understand how all these MelGAN papers are able to achieve such nice and non-metallic results and everything I have tried has not worked this well.

lexkoro commented 3 years ago

@george-roussos I've started a pwgan run on a german multispeaker dataset. Will report back once the first results come in.

OswaldoBornemann commented 3 years ago

May i asked that whether the dataset hours effect the PWGAN effects?

george-roussos commented 3 years ago

@george-roussos I've started a pwgan run on a german multispeaker dataset. Will report back once the first results come in.

Awesome!

@tsungruihon May i asked that whether the dataset hours effect the PWGAN effects?

Personally I always train with material that is around 25 hours 😀

thorstenMueller commented 3 years ago

May i asked that whether the dataset hours effect the PWGAN effects?

We experiment with my public contributed german dataset (around 23 hours).

https://github.com/thorstenMueller/deep-learning-german-tts#download-information

shahruk10 commented 3 years ago

@erogol Is the noise you mentioned concentrated at a particular frequency ? When I tried adapting the pretrained Multiband MelGAN model (starting from the 1.4M step) I encountered such noise.

I am currently training the MelGAN vocoder from scratch with my own data (outputs from tacotron2). For me the high freq noise appeared again at the beginning of training and stayed around for 100k steps. It was quite narrowband and uniform magnitude; also for me was at always centered around a frequency = 1/4th sample rate (1/4 * 22050 = 5512.5 Hz).

I wonder if it has something do with the PQMF synthesis ... One of the bands going awry and introducing the noise somehow ?

Here's a screenshot of the audio power spectrum of synthesized audio viewed in audacity.

image

I am still training - just reached 1M steps. It's sounding very good. Haven't seen the reemergence of the noise yet.

OswaldoBornemann commented 3 years ago

@shahruk10 May i ask that whether training from GTA generates better results than training from original melspec?

OswaldoBornemann commented 3 years ago

@george-roussos My dataset is just about 7 hours, i don't know whether this will affect the result. :sweat:

lexkoro commented 3 years ago

@george-roussos Any results from your side so far?

george-roussos commented 3 years ago

@george-roussos Any results from your side so far?

I stopped training at 500K since it wasn't getting any better and the loss stabilised around 0.49. So nothing new :-)

ysujiang commented 3 years ago

@erogol Is the noise you mentioned concentrated at a particular frequency ? When I tried adapting the pretrained Multiband MelGAN model (starting from the 1.4M step) I encountered such noise.

I am currently training the MelGAN vocoder from scratch with my own data (outputs from tacotron2). For me the high freq noise appeared again at the beginning of training and stayed around for 100k steps. It was quite narrowband and uniform magnitude; also for me was at always centered around a frequency = 1/4th sample rate (1/4 * 22050 = 5512.5 Hz).

I wonder if it has something do with the PQMF synthesis ... One of the bands going awry and introducing the noise somehow ?

Here's a screenshot of the audio power spectrum of synthesized audio viewed in audacity.

image

I am still training - just reached 1M steps. It's sounding very good. Haven't seen the reemergence of the noise yet.

@shahruk10 hello,I got the same result as the picture, Can you tell me how to solve it ?

erogol commented 3 years ago

@shahruk10 I'd be interested to hear your feedback if you could get any progress.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts