seungwonpark / melgan

MelGAN vocoder (compatible with NVIDIA/tacotron2)
http://swpark.me/melgan/
BSD 3-Clause "New" or "Revised" License
633 stars 116 forks source link

strange noises in your samples && error when running inference.py #30

Closed MorganCZY closed 4 years ago

MorganCZY commented 4 years ago

Your samples at epoch 3200 have strange noises at unvoiced segments, while there is no such phenomenon in samples at epoch 1600. noise Besides, when running inference.py, an error occurs, pointing to https://github.com/seungwonpark/melgan/blob/8af1e9c12e22f28186739664febb6b0285e3e212/model/generator.py#L68 torch.cat() has a parameter "dim" rather than "axis" cat

seungwonpark commented 4 years ago
seungwonpark commented 4 years ago

zeromel.zip

Spectrogram of zero-filled audio reconstruction looks like this: the line noise appears every 4 frequency bins.

EDIT: total frequency bins are 512. So the pattern appears for every 4 bins, not 8. The y-axis of the figure below is wrong.

zeromel

seungwonpark commented 4 years ago

I hope to fix it by matching the implementation details with official implementation. See #17.

MorganCZY commented 4 years ago

I have trained and tested the official MelGAN repo. Synthesized samples are with audible noises. The overall effect is far less than the official pretrained model.

seungwonpark commented 4 years ago

Oh, does that mean we need to use some tricks (that aren’t shown in paper) to properly train the model?

MorganCZY commented 4 years ago

I highly doubt there are some training tricks that are not shown in the official repo codes. I left an issue at their repo, but haven't received a reply till now.

bob80333 commented 4 years ago

Checkboard artifacts have been an issue with image GANs before, see this article: https://distill.pub/2016/deconv-checkerboard/

I think some of these audio artifacts may be related. The main way to get rid of them was to replace strided conv layers with bilinear upsample/downsample + conv layers, or to ensure that kernels were exact multiples of their strides. The discriminator here appears to have kernels of 41 with strides of 4, I wonder what would happen if we stuck a bilinear downsample 4x before those convs, and set the stride to 1.

I'm going to try this out myself, but first I'm waiting for a model I'm training on part of VoxCeleb2 (the full dataset doesn't fit in my ssd) to hit 1M training steps before I try this to see if there's any improvement.

seungwonpark commented 4 years ago

Nice point, but isn't it a problem of a generator? The generator architecture doesn't seem to have that kind of problem, but only discriminator does.

bob80333 commented 4 years ago

At the end of that article, just before the conclusion, they found that discriminators with stride=2 in the first layer could also cause the generator to create the checkboard artifacts. The explanation was that some of the neurons in the generator will get many times the gradient due to the striding in the discriminator, and that helps create the artifacts.

I don't know if that would apply to this audio GAN, but it seems like a fairly simple thing to check. I have modified the discriminator in my fork, and I will start a training run tonight to see if it helps.

bob80333 commented 4 years ago

Tested my fork out, the discriminator converges really fast, and the generator learns nothing.

Note the scales here

What the generator's output looks like:

Screen Shot 2019-11-15 at 9 58 21 AM Screen Shot 2019-11-15 at 9 58 13 AM

Swapping from strided convolutions to downsampling appears to have made the discriminator much stronger, not sure how to fix that...

seungwonpark commented 4 years ago

I feel sorry to hear that.

Is using nn.Upsample for downsampling is okay? The documentation says

If you want downsampling/general resizing, you should use interpolate

Thanks for sharing your results, by the way.

bob80333 commented 4 years ago

Oh! Nice catch, I missed that in the docs. I just fixed it in my fork, training is slightly better with this, but the discriminator still overpowers the generator quickly.

Discriminator converged in 2k steps rather than <500 steps.

Screen Shot 2019-11-15 at 4 03 43 PM
geekboood commented 4 years ago

@bob80333 Hi, I try to train the melgan on csmsc dataset, which is a single speaker dataset about 20 hours. My understand is the discriminator should converge pretty fast because at the very beginning the generator's result is indeed very easy to discern, since the result is very bad. And if you run for more epoches, you may find that the generator's result is improved at some time. Here is my tensorboard log. SharedScreenshot As you can see in the figure, the generator's loss stuck at around 120 before 300k iteration and after that, the loss is getting good. At the same time, the loss of the discriminator fluctuates a lot. I can hear something after 1.1M step, but it still with some artifacts. Maybe I should wait for 2M iteration. Also I found that at the end of each audio, there exist a peak that generates the noise.

bob80333 commented 4 years ago

Hey, thanks for the information! I have trained on my dataset (part of VoxCeleb2) with the current master branch for 1M steps and got this training curve:

Screen Shot 2019-11-16 at 9 23 54 PM

The results were understandable, but the voices themselves had artifacts while speaking, which is why I commented in this issue with ideas to fix it. The first modification I tried, I waited 80k steps, at which point the discriminator had gotten to 3.3e-5 loss and the generator was generating loud high pitched noises. I tried other approaches but the discriminator converged really quickly again, and I didn't want to wait to see if it failed, especially since my original training curve was very different from that.

seungwonpark commented 4 years ago

I've trained with fix/17 branch for 14 days (more than 6400 epochs) with LJSpeech-1.1 dataset, and the results don't have strange noise at unvoiced segment! I'll soon upload new audio samples(with pre-trained model, if possible), and merge fix/17 branch to master.

seungwonpark commented 4 years ago

Issues that were initially discussed here are now resolved, but I loved the idea and countless trials of @bob80333 to improve the quality. Feel free to have more discussion here, or you may want to open a new issue.