kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.54k stars 340 forks source link

Mismatched energy levels #164

Closed ahmed-fau closed 4 years ago

ahmed-fau commented 4 years ago

Hi, Thanks for sharing this great work. I have just one question:

During the training of all models, did you face a problem of mismatched energy levels between the target and generated speech waveforms? i.e., the training evolves properly and the quality gets better over epochs but the generated signal has a lower loudness than the original one.

If yes, what do you propose to tackle this? at least from the shared samples of your trained models, I cannot find that problem.

Many thanks in advance

kan-bayashi commented 4 years ago

I did not face the problem (or I was not aware of the problem), but I heard the similar issue from several people. Could you provide more information?

kan-bayashi commented 4 years ago

I checked the samples of LJSpeech. It seems no problem. Maybe training data is related?

スクリーンショット 2020-06-05 午後4 37 28
dathudeptrai commented 4 years ago

@kan-bayashi i got same issue in my vietnamese dataset on an old version of this repo :(. After re-compute y_hat, the problem solve. BUt i think it's not important because this is a systematic problem, just need to multiply mel spectrogram to 1.2-1.5.

kan-bayashi commented 4 years ago

@dathudeptrai Did you check both melgan and PWG?

@ahmed-fau I confirmed that if the utterances include long silence in the middle, it affects the quality. To deal with this problem, I trimmed the silence using force alignment results, which working well. If you have the text and aligner, it is worthwhile to try.

ahmed-fau commented 4 years ago

@kan-bayashi Thanks for your prompt reply. Actually, I am training a custom model on the LJspeech dataset and found this issue occurs whatever the model/training settings I use. I was thinking that was due to the normalization of the melspectrogram (as I use a different way for normalization instead of the offline approach) but it seems that normalization is not critical to solve this.

I confirmed that if the utterances include long silence in the middle, it affects the quality. To deal with this problem, I trimmed the silence using force alignment results, which working well. If you have the text and aligner, it is worthwhile to try.

That's interesting to try. I am training my models using a segments of ~ 1sec long.. do you think this is too long so that silence periods will affect there?

Have you used any waveform reconstruction loss (e.g. L1 norm) between the generated and original signals just to track the waveform matching/convergence?

ahmed-fau commented 4 years ago

@kan-bayashi In the official MelGAN repo, the reconstructed signals for LJ dataset have the same problem. I am using their approach for calculating the mel spectrograms (online during the training) so that I first cut a random slice of ~ 1sec window from the audio clip (16384 for sampling rate of 16kHz) then calculate the corresponding mel spctrogram online (with same parameters you have except the sampling rate). Do u think this way of data preparation is less accurate than yours?

kan-bayashi commented 4 years ago

OK. You did not use my repository and your question is the general topic about the training of MelGAN, right?

I am training my models using a segments of ~ 1sec long.. do you think this is too long so that silence periods will affect there?

I also use around randomly trimmed ~ 1sec for PWG and 8196 points for MelGAN. So the length is OK. The point is that the ratio of the voiced part and silence part in the training data.

Have you used any waveform reconstruction loss (e.g. L1 norm) between the generated and original signals just to track the waveform matching/convergence?

In my repository, I always use multi-resolution STFT loss.

Do u think this way of data preparation is less accurate than yours?

I also used randomly trimmed audio as the batch (the length is source ~1 sec). So the difference is normalization. I always use mean-var normalization. What is your normalization method?

ahmed-fau commented 4 years ago

OK. You did not use my repository and your question is the general topic about the training of MelGAN, right?

Yes, actually I find the energy levels of your samples look same as the original ones, in contrast to other MelGAN/PWGAN repos whose samples have mismatched energy levels. That's why I asked here if you have some tips about that. Sorry for any unintentional confusion.

I also used randomly trimmed audio as the batch (the length is source ~1 sec). So the difference is normalization. I always use mean-var normalization. What is your normalization method?

I use instance normalization without affine parameter learning so that it is just normalizing the mel spectrogram to be with zero mean and unit variance but along the channel dimension instead of the batch dimension.

Another difference is that my implementation first slices the 1sec audio segment then calculates its corresponding mel spectrogram during the training. While yours first slices a mel spectrogram then gets its corresponding audio segment from the dataset.

kan-bayashi commented 4 years ago

I use instance normalization without affine parameter learning so that it is just normalizing the mel spectrogram to be with zero mean and unit variance but along the channel dimension instead of the batch dimension. Another difference is that my implementation first slices the 1sec audio segment then calculates its corresponding mel spectrogram during the training.

In that case, the procedure is:

  1. randomly cut the audio (~ 1 sec)
  2. calculate mel-spectrogram
  3. perform mean-ver normalization for each piece of mel-spectrogram while keeping the audio waveform as original

Is my understanding correct?

If you perform the above procedure, I think the relationship between the power of the waveform and that of mel-spectrogram is different in each item in the batch. Maybe this caused the mismatch of the power level.

In my implementation, first I calculated the mean and var over the training data, and then perform mean-var normalization. Therefore, the relationship is consistent among the training data.

kan-bayashi commented 4 years ago

I will close this issue. If you provide the progress, it is helpful for the other users.

ahmed-fau commented 4 years ago

@kan-bayashi thanks, I was just waiting for the training.

I can confirm that your normalization approach has fixed this problem seamlessly, even without trimming the silence parts. After ~100 epochs, my custom model can track the original loudness/energy levels in contrast to the old normalization approach where the levels were bounded to fixed values throughout the training although the overall quality gets higher.

The main reason I didn't use your way of normalization is that the official MelGAN repo didn't apply it and I was biased to follow their preprocessing approach. Their trained models have that problem even at higher number of epochs (~6000).

The convergence of the spectral reconstruction loss is clearly faster after normalization. The model can reach a total spectral loss (sc + stft mag) of ~ 0.94 at 130 epochs, whereas the old one reaches a loss value of ~ 1.05 after 960 epochs for the LJspeech dataset using the same hyper-parameters.

Many thanks for your support.