Many iterations of discriminator training causes strange noise

kan-bayashi commented 5 years ago

I compared the following two models:

(Red) The model which trains the discriminator from 200k iters
(Blue) The model which trains the discriminator from the first iter Here is the training curve.
From the curve, the blue one is better than the red in terms of log STFT magnitude loss.

However, the blue model causes strange noise.

You can listen to the samples. https://drive.google.com/open?id=1LL_A4ysUqKJ13YQBdQwzNBvGp8m8BhqY

I think this is caused by the discriminator (v1 is red and v2 is blue). If you have any idea or suggestion to avoid this issue, please share with me.

G-Wang commented 5 years ago

I'm curious how much contribution training the discriminator helps with final audio quality. From the initial 75k step sample you showed, with only stft loss the generator seems to generate decent sounding audio already, and the loss curve seems to still go down. Perhaps it's worth investigating starting the adversarial training much later (300k etc) or not at all to see if the model can still generate good quality audio.

kan-bayashi commented 5 years ago

Hi @G-Wang. Thank you for your comments. Yes, I'm also curious about it. I will run with

discriminator_train_start_step=300000 # add discriminator more later
discriminator_train_start_step=400000 # not at all use discriminator

kan-bayashi commented 5 years ago

I finished training four models:

v1 (red): start to train discriminator from 100k iter
v2 (blue): start from 1 iter
v3 (light blue): start from 300k iter
v4 (pink): no discriminator

Here is the training curve.

Here is the sample.
https://drive.google.com/open?id=1LL_A4ysUqKJ13YQBdQwzNBvGp8m8BhqY

Consequently, in terms of perceptual quality, v1 is the best. The v1 loss value itself is higher than the v2 or v4 but the quality is the best. v4 (no discriminator) is the lowest loss but it causes strange noise as similar to v2. v3 also contains such kind of noise but smaller than v1.

G-Wang commented 5 years ago

@kan-bayashi I wonder if training the model with mulaw audio will help with the noise. In Deepmind's Gan-TTS ( https://openreview.net/forum?id=r1gfQgSFDr) they say they get better audio with mulaw.

kan-bayashi commented 5 years ago

Hi @G-Wang, this is interesting. I will try.

kan-bayashi commented 5 years ago

@G-Wang I started training (#37)! Please look forward to seeing the results!

kan-bayashi commented 5 years ago

So far, I could not confirm any improvement.

I tried several settings:

v1.mulaw: v1 config + Apply mu-law as preprocessing
v2.mulaw: v2 config + Apply mu-law as preprocessing
v1.mulaw.v2: v1 config + Apply mu-law for the only discriminator inputs

The training curves are as follows: where v1.single is the baseline with v1 config.

This is the example of v1.mulaw @ 100k steps It seems that spectral convergence loss for the mu-law converted signal is not effective.

After introducing adversarial loss, the outline became better But the sound quality is too bad.

If we start to train the discriminator from the first steps (v2.mulaw), it became slightly better. But compared to the baseline, the quality is too bad @ 160k steps.

I think one of these reasons is that the spectrogram of the mu-law converted signal is less meaningful.

So I tried to apply mu-law for only the inputs of discriminator (v1.mulaw.v2). But the spectral loss becomes bigger as you can see in the training curve.

You can listen to the samples at https://drive.google.com/open?id=1BrbB3Dh0c8HYxCQ5YGmXqNO-qXB0rJIB

patrickltobing commented 4 years ago

@kan-bayashi Hi, thanks for great repo. I have a question about the graphs on adversarial and fake losses. If I understood correctly, they should be a reflection of each other, yes, instead of having similar values?

I mean, the adversarial loss is computed for generator, right. Whereas, the fake loss is computed for discriminator.

That means, adv_loss = ||1-D(G(z))||, whereas fake_loss = ||D(G(z))||. So, if fake_loss is around 0.25, adv_loss should be around 0.75, no?

Or, maybe I misunderstood about it.

kan-bayashi commented 4 years ago

Hi @patrickltobing. Parallel WaveGAN uses least squeare GAN so we calculate L2norm, not L1norm. So the discriminator outputs should be around 0.5 and then both fake and real (adv) loss values will be around 0.25.

Rayhane-mamah commented 4 years ago

Hello @kan-bayashi, great work!

About the strange noise caused by the discriminator, my guess is that the discriminator doesn't have enough receptive field to capture such artifacts and penalize them.

The receptive field of the discriminator with a linearly increasing dilation rate, with 8 dilated layers of kernel size 3, and two other layers of dilation 1 gives a receptive field of 77 samples or 3.2ms if the audio has 24kHz. If that noise ticks once each 240 samples or 10ms let's say, the discriminator cannot learn that pattern and penalize it. Thus a wider receptive field for the discriminator should help with it.

Switching the 8 dilated layers to have a power of 2 ascending dilations yields a receptive field of 515 or 21.4ms for an audio of 24kHz, which might be more suited for this particular artifact.

Again, great work! Looking forward to see what can be done with this vocoder!

kan-bayashi commented 4 years ago

Hi @Rayhane-mamah. I have learned much things from your great repository :) Thank you for your great suggestions! I've never investigated the effectiveness of the discriminator network size. I will check it.

@audier Maybe this idea is also worthwhile to try in the case of the singing voice generation (#42).

Rayhane-mamah commented 4 years ago

Hi again @kan-bayashi. In theory, an adversary with the exact same receptive field as the generator should be able to detect all types of artifacts that the generator creates. Unfortunately, simply stacking 30 dilated conv layers in the discriminator will probably create a vanishing gradient problem.

Thankfully, there is a solution to that, which also provides a pretty cool extra: a discriminator that has the same architecture as the generator's:

fix vanishing gradient problem because of the skip connections
has the potential to discriminate real/generated audio at different receptive fields (first layers detect high frequency issues while deeper layers detect low frequency problems).

With this idea in mind, I trained a model very similar to your parallel WaveGan and found these results: https://drive.google.com/drive/folders/1FwivBIwKqCSd4Pz8dKl7Mwa6EtEUKYlt?usp=sharing

The model is trained for 200k steps. I also tried an earlier checkpoint of 130k steps but it had more low frequency problems. I am assuming a longer training will probably fix remaining issues of 200k checkpoint.

Finally, I can send a PR with this discriminator option if you want. Cheers!

PS: while using a deeper discriminator makes the sec/step speed slower during training, the learning however seems to be happening faster. i.e: the model might require less training steps (than 400k) to reach a good quality.

kan-bayashi commented 4 years ago

Hi @Rayhane-mamah. Thank you for your valuable report! This is great. I'm looking forward to seeing the results with 400k steps. PR is always welcome! Could you make a PR for your great extension? :)

kan-bayashi commented 4 years ago

@Rayhane-mamah Thank you for your great PR! Now I'm trying to train with your new discriminator. Now around 150k steps. I found that the discriminator becames very strong and adversarial loss will be almost 1.0. How about your case?

Rayhane-mamah commented 4 years ago

Oh really? Hmm this never happened in my experiments.. did you do 100k steps of generator pretraining?

On Tue, Jan 14, 2020, 8:31 PM Tomoki Hayashi notifications@github.com wrote:

@Rayhane-mamah https://github.com/Rayhane-mamah Thank you for your great PR! Now I'm trying to train with your new discriminator. Now around 150k steps. I found that the discriminator becames very strong and adversarial loss will be almost 1.0. How about your case?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kan-bayashi/ParallelWaveGAN/issues/27?email_source=notifications&email_token=AIIVFQDGO7HASCDGV23P653Q5ZRPDA5CNFSM4JJD5LJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI6XQUQ#issuecomment-574453842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVFQG2DALZX3XMX6HI5LLQ5ZRPDANCNFSM4JJD5LJQ .

kan-bayashi commented 4 years ago

Right. I will start to train discriminator from 100k steps. Here is the training curve. Until 110k, the curve seems very nice but it became too strong from 110k ~.

What I changed from your PR is batch size (6 -> 5) to train w/ a single GPU.

Hmm... Did you change other hyperparameters?

kan-bayashi commented 4 years ago

I will try with halved discriminator lr.

Rayhane-mamah commented 4 years ago

My model has slight different details in the spectrogram representation overall but everything else is kept as default. I used a batch size of 16 with 2 GPUs. Might be worth noting

On Tue, Jan 14, 2020, 9:56 PM Tomoki Hayashi notifications@github.com wrote:

I will try with halved discriminator lr.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kan-bayashi/ParallelWaveGAN/issues/27?email_source=notifications&email_token=AIIVFQHBNZL2DVVK66VNRALQ5Z3NZA5CNFSM4JJD5LJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI64CEI#issuecomment-574472465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVFQFFNV5ZQXP2PYFS2MDQ5Z3NZANCNFSM4JJD5LJQ .

Rayhane-mamah commented 4 years ago

From my small experience with GANs, if the discriminator is overpowering the generator, you probably want to reduce the batch size actually

On Tue, Jan 14, 2020, 9:57 PM rayhane mama rayhane.mamah@gmail.com wrote:

My model has slight different details in the spectrogram representation overall but everything else is kept as default. I used a batch size of 16 with 2 GPUs. Might be worth noting

On Tue, Jan 14, 2020, 9:56 PM Tomoki Hayashi notifications@github.com wrote:

I will try with halved discriminator lr.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kan-bayashi/ParallelWaveGAN/issues/27?email_source=notifications&email_token=AIIVFQHBNZL2DVVK66VNRALQ5Z3NZA5CNFSM4JJD5LJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI64CEI#issuecomment-574472465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVFQFFNV5ZQXP2PYFS2MDQ5Z3NZANCNFSM4JJD5LJQ .

kan-bayashi commented 4 years ago

@Rayhane-mamah Thank you for your information. I will check the batch-size effect.

kan-bayashi commented 4 years ago

Move on #61

kan-bayashi / ParallelWaveGAN

Many iterations of discriminator training causes strange noise #27