Closed kan-bayashi closed 4 years ago
I'm curious how much contribution training the discriminator helps with final audio quality. From the initial 75k step sample you showed, with only stft loss the generator seems to generate decent sounding audio already, and the loss curve seems to still go down. Perhaps it's worth investigating starting the adversarial training much later (300k etc) or not at all to see if the model can still generate good quality audio.
Hi @G-Wang. Thank you for your comments. Yes, I'm also curious about it. I will run with
discriminator_train_start_step=300000
# add discriminator more laterdiscriminator_train_start_step=400000
# not at all use discriminatorI finished training four models:
Here is the training curve.
Here is the sample.
https://drive.google.com/open?id=1LL_A4ysUqKJ13YQBdQwzNBvGp8m8BhqY
Consequently, in terms of perceptual quality, v1
is the best.
The v1
loss value itself is higher than the v2
or v4
but the quality is the best.
v4
(no discriminator) is the lowest loss but it causes strange noise as similar to v2
.
v3
also contains such kind of noise but smaller than v1
.
@kan-bayashi I wonder if training the model with mulaw audio will help with the noise. In Deepmind's Gan-TTS ( https://openreview.net/forum?id=r1gfQgSFDr) they say they get better audio with mulaw.
Hi @G-Wang, this is interesting. I will try.
@G-Wang I started training (#37)! Please look forward to seeing the results!
So far, I could not confirm any improvement.
I tried several settings:
v1.mulaw
: v1
config + Apply mu-law as preprocessingv2.mulaw
: v2
config + Apply mu-law as preprocessingv1.mulaw.v2
: v1
config + Apply mu-law for the only discriminator inputsThe training curves are as follows:
where v1.single
is the baseline with v1
config.
This is the example of v1.mulaw
@ 100k steps
It seems that spectral convergence loss for the mu-law converted signal is not effective.
After introducing adversarial loss, the outline became better But the sound quality is too bad.
If we start to train the discriminator from the first steps (v2.mulaw
), it became slightly better. But compared to the baseline, the quality is too bad @ 160k steps.
I think one of these reasons is that the spectrogram of the mu-law converted signal is less meaningful.
So I tried to apply mu-law for only the inputs of discriminator (v1.mulaw.v2
).
But the spectral loss becomes bigger as you can see in the training curve.
You can listen to the samples at https://drive.google.com/open?id=1BrbB3Dh0c8HYxCQ5YGmXqNO-qXB0rJIB
@kan-bayashi Hi, thanks for great repo. I have a question about the graphs on adversarial and fake losses. If I understood correctly, they should be a reflection of each other, yes, instead of having similar values?
I mean, the adversarial loss is computed for generator, right. Whereas, the fake loss is computed for discriminator.
That means, adv_loss = ||1-D(G(z))||
, whereas fake_loss = ||D(G(z))||
.
So, if fake_loss
is around 0.25
, adv_loss
should be around 0.75
, no?
Or, maybe I misunderstood about it.
Hi @patrickltobing.
Parallel WaveGAN uses least squeare GAN so we calculate L2norm
, not L1norm
.
So the discriminator outputs should be around 0.5 and then both fake and real (adv) loss values will be around 0.25.
Hello @kan-bayashi, great work!
About the strange noise caused by the discriminator, my guess is that the discriminator doesn't have enough receptive field to capture such artifacts and penalize them.
The receptive field of the discriminator with a linearly increasing dilation rate, with 8 dilated layers of kernel size 3, and two other layers of dilation 1 gives a receptive field of 77 samples or 3.2ms if the audio has 24kHz. If that noise ticks once each 240 samples or 10ms let's say, the discriminator cannot learn that pattern and penalize it. Thus a wider receptive field for the discriminator should help with it.
Switching the 8 dilated layers to have a power of 2 ascending dilations yields a receptive field of 515 or 21.4ms for an audio of 24kHz, which might be more suited for this particular artifact.
Again, great work! Looking forward to see what can be done with this vocoder!
Hi @Rayhane-mamah. I have learned much things from your great repository :) Thank you for your great suggestions! I've never investigated the effectiveness of the discriminator network size. I will check it.
@audier Maybe this idea is also worthwhile to try in the case of the singing voice generation (#42).
Hi again @kan-bayashi. In theory, an adversary with the exact same receptive field as the generator should be able to detect all types of artifacts that the generator creates. Unfortunately, simply stacking 30 dilated conv layers in the discriminator will probably create a vanishing gradient problem.
Thankfully, there is a solution to that, which also provides a pretty cool extra: a discriminator that has the same architecture as the generator's:
With this idea in mind, I trained a model very similar to your parallel WaveGan and found these results: https://drive.google.com/drive/folders/1FwivBIwKqCSd4Pz8dKl7Mwa6EtEUKYlt?usp=sharing
The model is trained for 200k steps. I also tried an earlier checkpoint of 130k steps but it had more low frequency problems. I am assuming a longer training will probably fix remaining issues of 200k checkpoint.
Finally, I can send a PR with this discriminator option if you want. Cheers!
PS: while using a deeper discriminator makes the sec/step speed slower during training, the learning however seems to be happening faster. i.e: the model might require less training steps (than 400k) to reach a good quality.
Hi @Rayhane-mamah. Thank you for your valuable report! This is great. I'm looking forward to seeing the results with 400k steps. PR is always welcome! Could you make a PR for your great extension? :)
@Rayhane-mamah Thank you for your great PR! Now I'm trying to train with your new discriminator. Now around 150k steps. I found that the discriminator becames very strong and adversarial loss will be almost 1.0. How about your case?
Oh really? Hmm this never happened in my experiments.. did you do 100k steps of generator pretraining?
On Tue, Jan 14, 2020, 8:31 PM Tomoki Hayashi notifications@github.com wrote:
@Rayhane-mamah https://github.com/Rayhane-mamah Thank you for your great PR! Now I'm trying to train with your new discriminator. Now around 150k steps. I found that the discriminator becames very strong and adversarial loss will be almost 1.0. How about your case?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kan-bayashi/ParallelWaveGAN/issues/27?email_source=notifications&email_token=AIIVFQDGO7HASCDGV23P653Q5ZRPDA5CNFSM4JJD5LJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI6XQUQ#issuecomment-574453842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVFQG2DALZX3XMX6HI5LLQ5ZRPDANCNFSM4JJD5LJQ .
Right. I will start to train discriminator from 100k steps. Here is the training curve. Until 110k, the curve seems very nice but it became too strong from 110k ~.
What I changed from your PR is batch size (6 -> 5) to train w/ a single GPU.
Hmm... Did you change other hyperparameters?
I will try with halved discriminator lr.
My model has slight different details in the spectrogram representation overall but everything else is kept as default. I used a batch size of 16 with 2 GPUs. Might be worth noting
On Tue, Jan 14, 2020, 9:56 PM Tomoki Hayashi notifications@github.com wrote:
I will try with halved discriminator lr.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kan-bayashi/ParallelWaveGAN/issues/27?email_source=notifications&email_token=AIIVFQHBNZL2DVVK66VNRALQ5Z3NZA5CNFSM4JJD5LJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI64CEI#issuecomment-574472465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVFQFFNV5ZQXP2PYFS2MDQ5Z3NZANCNFSM4JJD5LJQ .
From my small experience with GANs, if the discriminator is overpowering the generator, you probably want to reduce the batch size actually
On Tue, Jan 14, 2020, 9:57 PM rayhane mama rayhane.mamah@gmail.com wrote:
My model has slight different details in the spectrogram representation overall but everything else is kept as default. I used a batch size of 16 with 2 GPUs. Might be worth noting
On Tue, Jan 14, 2020, 9:56 PM Tomoki Hayashi notifications@github.com wrote:
I will try with halved discriminator lr.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kan-bayashi/ParallelWaveGAN/issues/27?email_source=notifications&email_token=AIIVFQHBNZL2DVVK66VNRALQ5Z3NZA5CNFSM4JJD5LJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI64CEI#issuecomment-574472465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVFQFFNV5ZQXP2PYFS2MDQ5Z3NZANCNFSM4JJD5LJQ .
@Rayhane-mamah Thank you for your information. I will check the batch-size effect.
Move on #61
I compared the following two models:
From the curve, the blue one is better than the red in terms of log STFT magnitude loss.
However, the blue model causes strange noise.
You can listen to the samples. https://drive.google.com/open?id=1LL_A4ysUqKJ13YQBdQwzNBvGp8m8BhqY
I think this is caused by the discriminator (v1 is red and v2 is blue). If you have any idea or suggestion to avoid this issue, please share with me.