Closed erogol closed 4 years ago
Hi @erogol. I also met this problem. I am trying to solve now (#54). You can see our discussion in (https://github.com/kan-bayashi/ParallelWaveGAN/issues/27#issuecomment-574457093). The things I tried are as follows:
A short literature review gave me these options if they make any sense;
Have you tried anything similar to these, so I can try the rest maybe?
@erogol @kan-bayashi hi, use transpose convolution rather than Upsampling on generator will help (use kernel_size and stride large enough). I'm combine melgan and parallel wavegan as this discussion #46 , the result seem very good for me.
@dathudeptrai merging the models only architecturally or training methods as well?
@erogol both, i use training method on this repo (V2 as u are using).
That is interesting. I will add MelGAN generator in this weekend.
keep in mind that all GAN arch, the author has tuned the parameters and arch carefully (generator won't stronger than discriminator and vice versar). In ur config, u don't use the original discriminator (u use residualdiscriminator instead), and that make the discriminator stronger, u need to modify generator too (increase number of layers, kernel size, ...) :D.
I added MelGAN generator in #62. Training is on going :D
@kan-bayashi wow, very fast :)), the training progress will 5x faster than Parallelwavegan :D, let see
I added initial MelGAN results #65 (v1 config based). The results seem to be reasonable and there is a room for improvement if we continue to train more iterations (the iterations are ongoing). I'm also trying v2-based MelGAN (ongoing). Please look forward to seeing the results :D
@kan-bayashi let see, i’m training melgan (almost same as u are doing, i use multi-scale discriminator loss as melgan introduce and some modification) with fake quantization training 8bit. Also i convert it successfully to tflite, the inference time on cpu mobile 1 thread is around 2x faster than real time.
@dathudeptrai You mean this feature? https://pytorch.org/docs/stable/quantization.html#torch.quantization.FakeQuantize That is amazing! Real-time generation on the device come true :D
@kan-bayashi No, i use this https://github.com/Xilinx/brevitas (need to expands input to 4D to training), this frame work just support conv2D :D. but it's enough :D.I also have a larger version of melgan generator then i convert it to tensorflow and use tensorrt to optimize it on server side :D. Seem all we need to do now is enhance the quality :)), the speed isn't a problem anymore :D
I can tell, even though there is a jump after 100K, still voice fidelity improves as the training goes. It is interesting to see.
@dathudeptrai is there an official way to convert models to tf, or is it just loading weights with the same architecture?
@erogol, ONNX is the way u can try, but according to my experience, loading weights with the same architecture is the best :)), i can implement both torch and tf very fast :D and the converting progress is just 1 for loop :D. in addition, when u use TF, u can ez to convert it into tflite and tensorrt to optimize for inference. I don't really like use intermediate representation like ONNX because the limitted support operator :D
I successfully trained MelGAN (Generator) in my fork. MelGAN requires the v1 (smaller), discriminator. I guess larger D is too strong for MelGAN to learn.
You can check the model with TTS here: https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b
There is still a slight background noise but it is because of the TTS model. Somehow, GAN models are not as robust as WaveRNN to spectrogram representation. Maybe it is a good idea to induce some random noise in training.
@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D
Currently, I am trying to train the MelGAN generator with
The training curve is not stable but it sounds more natural than v1. Once I finished the training, I will upload the config and results.
pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D
That's nice. I will try the following config.
generator_type: "MelGANGenerator" # Generator type.
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
channels: 1024 # Initial number of channels for conv layers.
upsample_scales: [4, 4, 4, 2, 2] # List of Upsampling scales.
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
stacks: 4 # Number of stacks in a single residual stack module.
use_weight_norm: True # Whether to use weight normalization.
use_causal_conv: False # Whether to use causal convolution.
I think melgan D stronger than v2 D :))) based on my knowledge :))) and seem its discriminator is very simmilar with gan TTS điscriminator (google iclr 2020 paper). https://arxiv.org/abs/1909.11646
@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D
It'd not since my hop_length is 275. Now I am training a new couple of models with 256 hop_length. There I can try the upsampling params.
I added several new results.
melgan.v1
: MelGAN G + PWG D #67 (sample)melgan_large.v1
: Large MelGAN G + PWG D #75 (sample)melgan.v3
: MelGAN G + MelGAN D #77 (sample)According to a few samples, it seems that melgan.v3
>> melgan_large.v1
>= melgan.v1
in terms of the naturalness.
But the training curve of v3
is not so stable, i.e., gradually decreasing both fake and real losses.
And feature matching loss kept increasing...
And STFT-based losses are much higher than v1 but higher naturalness. This is interesting but difficult to estimate the quality from the loss value :(
@kan-bayashi same observation as me :))). stft, spectral and feature matching loss keep increase over training progress :D and it need training more longer, i trained around 3M steps. I'm thinking about the discriminator overpower problem, maybe it's a problem with loss curve but interm of the naturalness, it doesn't :v .Seem we need the other metric to measure sound quality (MOS score). I'm reading the Frechet DeepSpeech Distance and Ker- nel DeepSpeech Distance which the author find to be well correlated with MOS score. (see GANTTS paper: https://arxiv.org/pdf/1909.11646.pdf) :D
@dathudeptrai Thank you for useful information. I will check the metrics.
Let me continue to train until 4M iters. About the discriminator (fake and real) loss, how do you think it is better to find the hyper params to make it stable around 0.25? or gradually decreasing is acceptable?
@kan-bayashi i'm still think make it stable around 0.25 is better. Now we can conclude that the STFT-based losses help the training convergence faster and better sound quality but somehow its value isn't important as we think before. How about if we weighted the STFT-based lossed smaller when start training discriminator ? or eliminate that :))). When we use STFT-based losses, seem it make the generator harder to competitive with discriminator :))
@dathudeptrai That is a nice idea. I will try.
Does anyone have the learning curve of the original MelGAN training? I want to check it.
I added several new results.
* `melgan.v1`: MelGAN G + PWG D #67 ([sample](https://drive.google.com/open?id=1ieLCKR_GYaYnbm0OwHv84nLWzRLwF3zr)) * `melgan_large.v1`: Large MelGAN G + PWG D #75 ([sample](https://drive.google.com/open?id=1HpdLiaflMRw60N6_lQkdEA4uBzPCNWGO)) * `melgan.v3`: MelGAN G + MelGAN D #77 ([sample](https://drive.google.com/open?id=1KeN5ojS7yuZEoOfLltbncP-KpftipM8X))
According to a few samples, it seems that
melgan.v3
>>melgan_large.v1
>=melgan.v1
in terms of the naturalness.But the training curve of
v3
is not so stable, i.e., gradually decreasing both fake and real losses. And feature matching loss kept increasing...And STFT-based losses are much higher than v1 but higher naturalness. This is interesting but difficult to estimate the quality from the loss value :(
I don't agree with naturalness. I believe ones with PWGAN-D are better, even in your examples. For instance 050-0032.wav
It is also the case in my runs. So PWGAN-D works better so far.
I also believe spectral features are good indicators of performance. In my experience, every time a model with better STFT losses performs better wrt voice quality.
@erogol maybe ur training is not long enough to see the performance of melgan D (i trained melgan D around 7M steps to obtain best result). If STFT losses indicate voice quality, why don't we just training generator with stft loss to obtain minimal value for it ?. when we start training discriminator, stft losses will increase but the audio quality still increase too.
@kan-bayashi i have the learning curve of the original MelGAN training, but it's after 7M steps :))). when the training convergence, the mels_construction loss is 3.2, generator loss is 10, feature matching loss is 10, discriminator loss is 0.75 :D.
I don't agree with naturalness. I believe ones with PWGAN-D are better, even in your examples. For instance 050-0032.wav
@erogol Oh really? For me, I feel that melgan.v1
sample contains signal processing-based vocoder-like noise (e.g. World or Straight) but v3
has no such noise. As @dathudeptrai said, maybe MelGAN D requires a lot of iters. Let us compare the samples after 4M iters.
i have the learning curve of the original MelGAN training, but it's after 7M steps :))). when the training convergence, the mels_construction loss is 3.2, generator loss is 10, feature matching loss is 10, discriminator loss is 0.75 :D.
@dathudeptrai Thank you for info. I started to train without STFT-based loss after introducing the discriminator but the discriminator loss is still gradually decreasing. So I want to confirm the loss of the original training keeps the same value or decreasing.
@erogol @kan-bayashi This is an official implementation of deepspeech distances on GanTTS paper :D. https://github.com/mbinkowski/DeepSpeechDistances. Maybe we can try this one to measure the quality of our models :D
I added 3M samples (#80) and updated demo HP. https://kan-bayashi.github.io/ParallelWaveGAN/ I think now PWG and MelGAN are comparable.
@kan-bayashi Hi, i'm not a native speaker but i think interns of naturalness they are comparable too. How about the noise?, in my experiment, melgan D has no such noise as PWG.
How about the noise?, in my experiment, melgan D has no such noise as PWG.
You mean noise observed in melgan.v1
model i.e., MelGAN G trained with PWG D?
If we train with MelGAN D (melgan.v3
), such noise will disappear.
@kan-bayashi okay :D. Just curious, what is ur next plan :)), what is the next version u will training, v2 ? :D
Currently, I'm training MelGAN G + MelGAN D based on your idea (https://github.com/kan-bayashi/ParallelWaveGAN/issues/61#issuecomment-588007905), and I'm curious about the results with PWG G + MelGAN D. I want to try it.
I've trained several models based on v2
i.e., ResiduralParallelWaveGANDiscriminator
, but the results are not so good. So v2
has no high priority.
I compared melgan.v3.long
samples with the sample w/o STFT-loss after introducing the discriminator @ 4M iters. melgan.v3.long
is clearly better while the training curve of the discriminator loss is almost the same.
Red: w/o STFT-loss after introducing the discriminator
Blue+Orange: melgan.v3.long
STFT-loss is not used for backpropagation, just monitoring.
So, I conclude MelGAN G + MelGAN D + STFT-loss can improve the quality, or at least the convergence speed.
okay :D, how about PWG G + MelganD + STFT-loss ?
okay :D, how about PWG G + MelganD + STFT-loss ?
On-going. The quality @ 1.1M iters is so so.
You can check the model with TTS here: https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b
waveform.flatten()
is required, otherwise 3D tensor of shape 1x1xsample_length
First verse from "Milky Chance -The Game", vocoder 400K EN, LJSpeech, ParallelWaveGan, +spleeter, +manual lyrics alignment in audacity
@nartes Thank you for sharing interesting sample! Did you remove vocal using spleeter and then mix generated voice with separated instrument music?
yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences, and later i needed to stretch an audio with some default algo in audacity by 20% to match an original verse by Milky Chance.
yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences, and later i needed to stretch an audio with some default algo in audacity by 20% to match an original verse by Milky Chance.
Interesting! Great work:)
What happens if 100K restriction for discriminator is being replaced with say 0K. I mean not to pretrain model before applying an adversary?
I tried. The quality is OK but it contains a strange noise.
https://github.com/kan-bayashi/ParallelWaveGAN/issues/27#issue-517832548
Another question what options are available to reduce model capacity so that support higher batch size, like 32 or 64 with 16K sampling rate at least.
You can try MelGAN or reduce batch_max_steps.
What are GPU memory requirements at the moment? I have about 12GiB of GPU RAM with vanilla WaveNet, 16K sampling rate, 6 batch size. It's about 4-6M of weight parameters.
You mean training? Please check the header comments in config. https://github.com/kan-bayashi/ParallelWaveGAN#results There is information about the required gpu.
Is it correct that without discriminator applicaiton, pure STFT loss does produce artificats in the synthesis. It was my conclusion after listening to intermediate ParallelWaveGan examples before checkpoint with 130K.
If we use only STFT-loss, the sound will be like a robot. The adversarial training is needed to improve the perceptual quality.
Q: Is there some place where melspectrogram is being transformed. I didn't find it in the repo source code.
It is being trained to decode melspectrogram.
Input features are generated with https://github.com/kan-bayashi/ParallelWaveGAN/blob/53ae639e355591299a94a004c22c49a32b70cc1f/parallel_wavegan/bin/preprocess.py#L25
Yet comparing FastSpeech melspectrogram synthesizer with output from logmelfilterbank. I get different boundaries for magnitude:
Different synthesized audio:
Below are examples of magnitude distribution using a public Google Colab notebook
linked in the repo.
I did replace synthesized melspectrogram c
with mel_spec2
.
Output is audible but noisy, and it gets better if spectrogram is being shifted
like mel_spec2 + 2.0
. I did use 1000K MelGAN in the notebook.
With 400K MelGAN, without shifting the melspectrogram, an output is too quiet
and has a constant artificial frequency noise along.
The melspectrogram is normalized to be mean=0, variance=1 using the statistics of training data. https://github.com/kan-bayashi/ParallelWaveGAN/blob/53ae639e355591299a94a004c22c49a32b70cc1f/egs/ljspeech/voc1/run.sh#L97-L102
Yet comparing FastSpeech melspectrogram synthesizer with output from logmelfilterbank. I get different boundaries for magnitude:
This is because I use the normalization to be mean=0, variacne=1. The explicit maximum and minimum values are not defined.
With 400K MelGAN, without shifting the melspectrogram, an output is too quiet and has a constant artificial frequency noise along.
I'm not sure what you are doing, but my pretrained models assume that the inputs are normalized using the statistics of training data.
Indeed, forgot about it, thanks!
I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?
I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)
Here is the tensorboard screenshot.