kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.57k stars 343 forks source link

TTS + ParallelWaveGAN progress #36

Closed erogol closed 4 years ago

erogol commented 5 years ago

If you don't mind, I like to share my progress with PWGAN with TTS.

Here is the first try results: https://soundcloud.com/user-565970875/sets/ljspeech_tacotron_5233_paralle

Results are not better than what we have with WaveRNN, I should say it is much faster.

There is a hissing noise in the backgroung. If you have any idea to get rid of this, please let me know.

The only difference in training (I guess) I don't apply mean-normalization to melspectrograms and I normalize to -4,4 range.

kan-bayashi commented 5 years ago

Hi @erogol. Thank you for sharing your progress! We also try to combine our TTS with Parallel WaveGAN. You can check our results at google drive. https://drive.google.com/open?id=1HvB0_LDf1PVinJdehiuCt5gWmXGguqtx (Maybe phn_train_no_dev_pytorch_train_pytorch_tacotron2.v3 is nice to compare with yours.) The basic quality is almost the same as yours, but in ours samples, there is no or less hissing noise than yours. I think this noise is coming from the normalization difference.

How did you perform normalization? Did you do generate -> denormalize -> normalize with our stats -> synthesis?

erogol commented 5 years ago

I use the same normalization scheme for TTS and PWGAN which is

self.min_level_db = -100
sellf.max_norm = 4
S_norm = ((S - self.min_level_db) / - self.min_level_db)            
S_norm = ((2 * self.max_norm) * S_norm) - self.max_norm
S_norm = np.clip(S_norm, -self.max_norm, self.max_norm)

Did you also observe hissing without mean normalization if you've tried?

And above results are with epsnet TTS right?

kan-bayashi commented 5 years ago

I see. I’ve never tried without mean var normalization. How about the analysis-synthesis sample? Is there also hissing noise?

Yes, the samples are generated by ESPnet-TTS.

kan-bayashi commented 5 years ago

I came up with one idea. In the case of WaveGlow, changing the standard deviation of input noise can reduce the high-frequency noise. It is worthwhile to try to change it in the inference.

erogol commented 5 years ago

@kan-bayashi could you tell me the intuition behind mean var normalization in this case. Or is it just the way in general GANs are trained?

What is analysis-synthesis? Is it just testing the trained model with the training specs?

Yeah makes sense. I'll try the std trick and let you know the result. Thx for the suggestion!

kan-bayashi commented 5 years ago

Analysis-synthesis means synthesis with natural (groundtruth) features. I've never compared among normalization methods. Actually, the reason why I used mean var normalization is that I am from ASR. I followed the manner in ASR, so I have no technical reason... Sorry...

erogol commented 5 years ago

Alas, std trick does not work. The lower causes voice loss and the higher makes the noise worse. I'll try mean var normalization and see if it makes the trick.

Might be about the lazy TTS model as well. The first r=2 and it is much smaller Tacotron model. Now, I also train a Tacotron2 model.

I should also say, more training helps for reducing the noise.

kan-bayashi commented 5 years ago

Thanks for your report, @erogol. I also tried to combine r = 2 or r = 3 tacotron2 / Transformer. It seems working without hissing noise. (You can try in Colab of Japanese demo.) I'm looking forward to seeing the difference between normalization.

I should also say, more training helps for reducing the noise.

You mean more iterations of Parallel WaveGAN improve the quality?

erogol commented 5 years ago

@kan-bayashi yes more iteration with PWGAN improved the quality.

Then we have only two possible reasons.

  1. normalization
  2. tacotron vs tacotron2
erogol commented 5 years ago

@kan-bayashi are the examples on gdrive with tacotron2.v3 trained with the conf train_pytorch_tacotron2.v3.yaml?

And which attention for Tacotron2 worked the best so far for you ? (I just try to replicate on-to-one experiments with Mozilla TTS)

kan-bayashi commented 5 years ago

@erogol Yes, the model is based on that config. https://github.com/espnet/espnet/blob/master/egs/ljspeech/tts1/conf/tuning/train_pytorch_tacotron2.v3.yaml

I compared forward attention with transition agent and location-sensitive + guided attention. The MOS and CER are almost the same on LJSpeech dataset. (Note that in location-sensitive attention we use attention accumulation.)

rishikksh20 commented 5 years ago

@erogol I confirmed that hiss noise is due to mean var normalization, I am using ESPnet FastSpeech architecture and trained with my own preprocessing with -4,4 normalization, I also get noisy hiss like constant sound with GL vocoder and with WaveRNN both. Though I haven't trained WaveRNN with GTA generated by FastSpeech I just used pre-trained LJSpeech with the same pre-processing. I haven't familiar with FastSpeech architecture but I planned to do the same thing with Taco2 (very comfortable with architecture and training) of ESPnet, and still, I get the same result then I definitely like to go deep to architecture and training implemented in ESPnet and ParallelWaveGAN.

erogol commented 5 years ago

@rishikksh20 as far as I understand, you actually did not try ParallelWaveGAN but WaveRNN, is this right? Did you also observe better results with mean var normalization for the same models above?

erogol commented 5 years ago

I forgot to trim silences. And retraining it with silence trimmed, made it better. There is still a slight background noise but, it is there in the original recordings too.

kan-bayashi commented 5 years ago

@erogol Do you mean that the reason for hissing noise is caused by the silence part? If so, the difference of the normalization methods did not affect the quality, right?

erogol commented 5 years ago

@kan-bayashi I guess yes, but before being sure, I need to run a model with normalization too. But if I compare the latest mozilla-tts results with the espnet examples above, they are on par.

erogol commented 4 years ago

Actually I realized training more exesperated the results and made the noise more apperant. I'll try mean-var normalization for the next time.

erogol commented 4 years ago

The problem above looks like all with silence trimming. After that, things started to work well.

I can also tell preemphasis helps the model to converge more stable. grafik