kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.54k stars 339 forks source link

Generator exploded after ~138K iters. #61

Closed erogol closed 4 years ago

erogol commented 4 years ago

I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?

I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)

Here is the tensorboard screenshot.

image

kan-bayashi commented 4 years ago

Hi @erogol. I also met this problem. I am trying to solve now (#54). You can see our discussion in (https://github.com/kan-bayashi/ParallelWaveGAN/issues/27#issuecomment-574457093). The things I tried are as follows:

erogol commented 4 years ago

A short literature review gave me these options if they make any sense;

Have you tried anything similar to these, so I can try the rest maybe?

dathudeptrai commented 4 years ago

@erogol @kan-bayashi hi, use transpose convolution rather than Upsampling on generator will help (use kernel_size and stride large enough). I'm combine melgan and parallel wavegan as this discussion #46 , the result seem very good for me.

erogol commented 4 years ago

@dathudeptrai merging the models only architecturally or training methods as well?

dathudeptrai commented 4 years ago

@erogol both, i use training method on this repo (V2 as u are using).

kan-bayashi commented 4 years ago

That is interesting. I will add MelGAN generator in this weekend.

dathudeptrai commented 4 years ago

keep in mind that all GAN arch, the author has tuned the parameters and arch carefully (generator won't stronger than discriminator and vice versar). In ur config, u don't use the original discriminator (u use residualdiscriminator instead), and that make the discriminator stronger, u need to modify generator too (increase number of layers, kernel size, ...) :D.

kan-bayashi commented 4 years ago

I added MelGAN generator in #62. Training is on going :D

dathudeptrai commented 4 years ago

@kan-bayashi wow, very fast :)), the training progress will 5x faster than Parallelwavegan :D, let see

kan-bayashi commented 4 years ago

I added initial MelGAN results #65 (v1 config based). The results seem to be reasonable and there is a room for improvement if we continue to train more iterations (the iterations are ongoing). I'm also trying v2-based MelGAN (ongoing). Please look forward to seeing the results :D

dathudeptrai commented 4 years ago

@kan-bayashi let see, i’m training melgan (almost same as u are doing, i use multi-scale discriminator loss as melgan introduce and some modification) with fake quantization training 8bit. Also i convert it successfully to tflite, the inference time on cpu mobile 1 thread is around 2x faster than real time.

kan-bayashi commented 4 years ago

@dathudeptrai You mean this feature? https://pytorch.org/docs/stable/quantization.html#torch.quantization.FakeQuantize That is amazing! Real-time generation on the device come true :D

dathudeptrai commented 4 years ago

@kan-bayashi No, i use this https://github.com/Xilinx/brevitas (need to expands input to 4D to training), this frame work just support conv2D :D. but it's enough :D.I also have a larger version of melgan generator then i convert it to tensorflow and use tensorrt to optimize it on server side :D. Seem all we need to do now is enhance the quality :)), the speed isn't a problem anymore :D

erogol commented 4 years ago

I can tell, even though there is a jump after 100K, still voice fidelity improves as the training goes. It is interesting to see.

@dathudeptrai is there an official way to convert models to tf, or is it just loading weights with the same architecture?

dathudeptrai commented 4 years ago

@erogol, ONNX is the way u can try, but according to my experience, loading weights with the same architecture is the best :)), i can implement both torch and tf very fast :D and the converting progress is just 1 for loop :D. in addition, when u use TF, u can ez to convert it into tflite and tensorrt to optimize for inference. I don't really like use intermediate representation like ONNX because the limitted support operator :D

erogol commented 4 years ago

I successfully trained MelGAN (Generator) in my fork. MelGAN requires the v1 (smaller), discriminator. I guess larger D is too strong for MelGAN to learn.

You can check the model with TTS here: https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

There is still a slight background noise but it is because of the TTS model. Somehow, GAN models are not as robust as WaveRNN to spectrogram representation. Maybe it is a good idea to induce some random noise in training.

dathudeptrai commented 4 years ago

@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

kan-bayashi commented 4 years ago

Currently, I am trying to train the MelGAN generator with

The training curve is not stable but it sounds more natural than v1. Once I finished the training, I will upload the config and results.

kan-bayashi commented 4 years ago

pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

That's nice. I will try the following config.

generator_type: "MelGANGenerator" # Generator type.
generator_params:
    in_channels: 80                  # Number of input channels.
    out_channels: 1                  # Number of output channels.
    kernel_size: 7                   # Kernel size of initial and final conv layers.
    channels: 1024                   # Initial number of channels for conv layers.
    upsample_scales: [4, 4, 4, 2, 2] # List of Upsampling scales.
    stack_kernel_size: 3             # Kernel size of dilated conv layers in residual stack.
    stacks: 4                        # Number of stacks in a single residual stack module.
    use_weight_norm: True            # Whether to use weight normalization.
    use_causal_conv: False           # Whether to use causal convolution.
dathudeptrai commented 4 years ago

I think melgan D stronger than v2 D :))) based on my knowledge :))) and seem its discriminator is very simmilar with gan TTS điscriminator (google iclr 2020 paper). https://arxiv.org/abs/1909.11646

erogol commented 4 years ago

@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

It'd not since my hop_length is 275. Now I am training a new couple of models with 256 hop_length. There I can try the upsampling params.

kan-bayashi commented 4 years ago

I added several new results.

According to a few samples, it seems that melgan.v3 >> melgan_large.v1 >= melgan.v1 in terms of the naturalness.

But the training curve of v3 is not so stable, i.e., gradually decreasing both fake and real losses. And feature matching loss kept increasing...

スクリーンショット 2020-02-19 午前10 29 15

And STFT-based losses are much higher than v1 but higher naturalness. This is interesting but difficult to estimate the quality from the loss value :(

dathudeptrai commented 4 years ago

@kan-bayashi same observation as me :))). stft, spectral and feature matching loss keep increase over training progress :D and it need training more longer, i trained around 3M steps. I'm thinking about the discriminator overpower problem, maybe it's a problem with loss curve but interm of the naturalness, it doesn't :v .Seem we need the other metric to measure sound quality (MOS score). I'm reading the Frechet DeepSpeech Distance and Ker- nel DeepSpeech Distance which the author find to be well correlated with MOS score. (see GANTTS paper: https://arxiv.org/pdf/1909.11646.pdf) :D

kan-bayashi commented 4 years ago

@dathudeptrai Thank you for useful information. I will check the metrics.

Let me continue to train until 4M iters. About the discriminator (fake and real) loss, how do you think it is better to find the hyper params to make it stable around 0.25? or gradually decreasing is acceptable?

dathudeptrai commented 4 years ago

@kan-bayashi i'm still think make it stable around 0.25 is better. Now we can conclude that the STFT-based losses help the training convergence faster and better sound quality but somehow its value isn't important as we think before. How about if we weighted the STFT-based lossed smaller when start training discriminator ? or eliminate that :))). When we use STFT-based losses, seem it make the generator harder to competitive with discriminator :))

kan-bayashi commented 4 years ago

@dathudeptrai That is a nice idea. I will try.

kan-bayashi commented 4 years ago

Does anyone have the learning curve of the original MelGAN training? I want to check it.

erogol commented 4 years ago

I added several new results.

* `melgan.v1`: MelGAN G + PWG D #67 ([sample](https://drive.google.com/open?id=1ieLCKR_GYaYnbm0OwHv84nLWzRLwF3zr))

* `melgan_large.v1`: Large MelGAN G + PWG D #75 ([sample](https://drive.google.com/open?id=1HpdLiaflMRw60N6_lQkdEA4uBzPCNWGO))

* `melgan.v3`: MelGAN G + MelGAN D #77 ([sample](https://drive.google.com/open?id=1KeN5ojS7yuZEoOfLltbncP-KpftipM8X))

According to a few samples, it seems that melgan.v3 >> melgan_large.v1 >= melgan.v1 in terms of the naturalness.

But the training curve of v3 is not so stable, i.e., gradually decreasing both fake and real losses. And feature matching loss kept increasing...

スクリーンショット 2020-02-19 午前10 29 15

And STFT-based losses are much higher than v1 but higher naturalness. This is interesting but difficult to estimate the quality from the loss value :(

I don't agree with naturalness. I believe ones with PWGAN-D are better, even in your examples. For instance 050-0032.wav

It is also the case in my runs. So PWGAN-D works better so far.

I also believe spectral features are good indicators of performance. In my experience, every time a model with better STFT losses performs better wrt voice quality.

dathudeptrai commented 4 years ago

@erogol maybe ur training is not long enough to see the performance of melgan D (i trained melgan D around 7M steps to obtain best result). If STFT losses indicate voice quality, why don't we just training generator with stft loss to obtain minimal value for it ?. when we start training discriminator, stft losses will increase but the audio quality still increase too.

dathudeptrai commented 4 years ago

@kan-bayashi i have the learning curve of the original MelGAN training, but it's after 7M steps :))). when the training convergence, the mels_construction loss is 3.2, generator loss is 10, feature matching loss is 10, discriminator loss is 0.75 :D.

kan-bayashi commented 4 years ago

I don't agree with naturalness. I believe ones with PWGAN-D are better, even in your examples. For instance 050-0032.wav

@erogol Oh really? For me, I feel that melgan.v1 sample contains signal processing-based vocoder-like noise (e.g. World or Straight) but v3 has no such noise. As @dathudeptrai said, maybe MelGAN D requires a lot of iters. Let us compare the samples after 4M iters.

i have the learning curve of the original MelGAN training, but it's after 7M steps :))). when the training convergence, the mels_construction loss is 3.2, generator loss is 10, feature matching loss is 10, discriminator loss is 0.75 :D.

@dathudeptrai Thank you for info. I started to train without STFT-based loss after introducing the discriminator but the discriminator loss is still gradually decreasing. So I want to confirm the loss of the original training keeps the same value or decreasing.

dathudeptrai commented 4 years ago

@erogol @kan-bayashi This is an official implementation of deepspeech distances on GanTTS paper :D. https://github.com/mbinkowski/DeepSpeechDistances. Maybe we can try this one to measure the quality of our models :D

kan-bayashi commented 4 years ago

I added 3M samples (#80) and updated demo HP. https://kan-bayashi.github.io/ParallelWaveGAN/ I think now PWG and MelGAN are comparable.

dathudeptrai commented 4 years ago

@kan-bayashi Hi, i'm not a native speaker but i think interns of naturalness they are comparable too. How about the noise?, in my experiment, melgan D has no such noise as PWG.

kan-bayashi commented 4 years ago

How about the noise?, in my experiment, melgan D has no such noise as PWG.

You mean noise observed in melgan.v1 model i.e., MelGAN G trained with PWG D? If we train with MelGAN D (melgan.v3), such noise will disappear.

dathudeptrai commented 4 years ago

@kan-bayashi okay :D. Just curious, what is ur next plan :)), what is the next version u will training, v2 ? :D

kan-bayashi commented 4 years ago

Currently, I'm training MelGAN G + MelGAN D based on your idea (https://github.com/kan-bayashi/ParallelWaveGAN/issues/61#issuecomment-588007905), and I'm curious about the results with PWG G + MelGAN D. I want to try it. I've trained several models based on v2 i.e., ResiduralParallelWaveGANDiscriminator, but the results are not so good. So v2 has no high priority.

kan-bayashi commented 4 years ago

I compared melgan.v3.long samples with the sample w/o STFT-loss after introducing the discriminator @ 4M iters. melgan.v3.long is clearly better while the training curve of the discriminator loss is almost the same.

Red: w/o STFT-loss after introducing the discriminator Blue+Orange: melgan.v3.long

スクリーンショット 2020-02-28 午後5 21 14 スクリーンショット 2020-02-28 午後5 21 03

STFT-loss is not used for backpropagation, just monitoring.

So, I conclude MelGAN G + MelGAN D + STFT-loss can improve the quality, or at least the convergence speed.

dathudeptrai commented 4 years ago

okay :D, how about PWG G + MelganD + STFT-loss ?

kan-bayashi commented 4 years ago

okay :D, how about PWG G + MelganD + STFT-loss ?

On-going. The quality @ 1.1M iters is so so.

nartes commented 4 years ago

You can check the model with TTS here: https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

waveform.flatten() is required, otherwise 3D tensor of shape 1x1xsample_length

image

nartes commented 4 years ago

First verse from "Milky Chance -The Game", vocoder 400K EN, LJSpeech, ParallelWaveGan, +spleeter, +manual lyrics alignment in audacity

kan-bayashi commented 4 years ago

@nartes Thank you for sharing interesting sample! Did you remove vocal using spleeter and then mix generated voice with separated instrument music?

nartes commented 4 years ago

yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences, and later i needed to stretch an audio with some default algo in audacity by 20% to match an original verse by Milky Chance.

kan-bayashi commented 4 years ago

yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences, and later i needed to stretch an audio with some default algo in audacity by 20% to match an original verse by Milky Chance.

Interesting! Great work:)

nartes commented 4 years ago
  1. What happens if 100K restriction for discriminator is being replaced with say 0K. I mean not to pretrain model before applying an adversary?
  2. Another question what options are available to reduce model capacity so that support higher batch size, like 32 or 64 with 16K sampling rate at least.
  3. What are GPU memory requirements at the moment? I have about 12GiB of GPU RAM with vanilla WaveNet, 16K sampling rate, 6 batch size. It's about 4-6M of weight parameters.
  4. Is it correct that without discriminator applicaiton, pure STFT loss does produce artificats in the synthesis. It was my conclusion after listening to intermediate ParallelWaveGan examples before checkpoint with 130K.
kan-bayashi commented 4 years ago

What happens if 100K restriction for discriminator is being replaced with say 0K. I mean not to pretrain model before applying an adversary?

I tried. The quality is OK but it contains a strange noise.
https://github.com/kan-bayashi/ParallelWaveGAN/issues/27#issue-517832548

Another question what options are available to reduce model capacity so that support higher batch size, like 32 or 64 with 16K sampling rate at least.

You can try MelGAN or reduce batch_max_steps.

What are GPU memory requirements at the moment? I have about 12GiB of GPU RAM with vanilla WaveNet, 16K sampling rate, 6 batch size. It's about 4-6M of weight parameters.

You mean training? Please check the header comments in config. https://github.com/kan-bayashi/ParallelWaveGAN#results There is information about the required gpu.

Is it correct that without discriminator applicaiton, pure STFT loss does produce artificats in the synthesis. It was my conclusion after listening to intermediate ParallelWaveGan examples before checkpoint with 130K.

If we use only STFT-loss, the sound will be like a robot. The adversarial training is needed to improve the perceptual quality.

nartes commented 4 years ago

Q: Is there some place where melspectrogram is being transformed. I didn't find it in the repo source code.

  1. MelGAN vocoder

  2. It is being trained to decode melspectrogram.

  3. Input features are generated with https://github.com/kan-bayashi/ParallelWaveGAN/blob/53ae639e355591299a94a004c22c49a32b70cc1f/parallel_wavegan/bin/preprocess.py#L25

  4. Yet comparing FastSpeech melspectrogram synthesizer with output from logmelfilterbank. I get different boundaries for magnitude:

    • [-4.9; 0.34] (logmelfilterbank)
    • [-1.6; 2.35] (FastSpeech)
  5. Different synthesized audio:

  6. Below are examples of magnitude distribution using a public Google Colab notebook linked in the repo. I did replace synthesized melspectrogram c with mel_spec2. Output is audible but noisy, and it gets better if spectrogram is being shifted like mel_spec2 + 2.0. I did use 1000K MelGAN in the notebook. With 400K MelGAN, without shifting the melspectrogram, an output is too quiet and has a constant artificial frequency noise along.

image image

kan-bayashi commented 4 years ago

The melspectrogram is normalized to be mean=0, variance=1 using the statistics of training data. https://github.com/kan-bayashi/ParallelWaveGAN/blob/53ae639e355591299a94a004c22c49a32b70cc1f/egs/ljspeech/voc1/run.sh#L97-L102

Yet comparing FastSpeech melspectrogram synthesizer with output from logmelfilterbank. I get different boundaries for magnitude:

This is because I use the normalization to be mean=0, variacne=1. The explicit maximum and minimum values are not defined.

With 400K MelGAN, without shifting the melspectrogram, an output is too quiet and has a constant artificial frequency noise along.

I'm not sure what you are doing, but my pretrained models assume that the inputs are normalized using the statistics of training data.

nartes commented 4 years ago

Indeed, forgot about it, thanks!