Closed erogol closed 4 years ago
Hi @erogol. Thank you for sharing your progress!
We also try to combine our TTS with Parallel WaveGAN.
You can check our results at google drive.
https://drive.google.com/open?id=1HvB0_LDf1PVinJdehiuCt5gWmXGguqtx
(Maybe phn_train_no_dev_pytorch_train_pytorch_tacotron2.v3
is nice to compare with yours.)
The basic quality is almost the same as yours, but in ours samples, there is no or less hissing noise than yours. I think this noise is coming from the normalization difference.
How did you perform normalization? Did you do generate -> denormalize -> normalize with our stats -> synthesis?
I use the same normalization scheme for TTS and PWGAN which is
self.min_level_db = -100
sellf.max_norm = 4
S_norm = ((S - self.min_level_db) / - self.min_level_db)
S_norm = ((2 * self.max_norm) * S_norm) - self.max_norm
S_norm = np.clip(S_norm, -self.max_norm, self.max_norm)
Did you also observe hissing without mean normalization if you've tried?
And above results are with epsnet TTS right?
I see. I’ve never tried without mean var normalization. How about the analysis-synthesis sample? Is there also hissing noise?
Yes, the samples are generated by ESPnet-TTS.
I came up with one idea. In the case of WaveGlow, changing the standard deviation of input noise can reduce the high-frequency noise. It is worthwhile to try to change it in the inference.
@kan-bayashi could you tell me the intuition behind mean var normalization in this case. Or is it just the way in general GANs are trained?
What is analysis-synthesis? Is it just testing the trained model with the training specs?
Yeah makes sense. I'll try the std trick and let you know the result. Thx for the suggestion!
Analysis-synthesis means synthesis with natural (groundtruth) features. I've never compared among normalization methods. Actually, the reason why I used mean var normalization is that I am from ASR. I followed the manner in ASR, so I have no technical reason... Sorry...
Alas, std trick does not work. The lower causes voice loss and the higher makes the noise worse. I'll try mean var normalization and see if it makes the trick.
Might be about the lazy TTS model as well. The first r=2 and it is much smaller Tacotron model. Now, I also train a Tacotron2 model.
I should also say, more training helps for reducing the noise.
Thanks for your report, @erogol.
I also tried to combine r = 2
or r = 3
tacotron2 / Transformer.
It seems working without hissing noise.
(You can try in Colab of Japanese demo.)
I'm looking forward to seeing the difference between normalization.
I should also say, more training helps for reducing the noise.
You mean more iterations of Parallel WaveGAN improve the quality?
@kan-bayashi yes more iteration with PWGAN improved the quality.
Then we have only two possible reasons.
@kan-bayashi are the examples on gdrive with tacotron2.v3 trained with the conf train_pytorch_tacotron2.v3.yaml?
And which attention for Tacotron2 worked the best so far for you ? (I just try to replicate on-to-one experiments with Mozilla TTS)
@erogol Yes, the model is based on that config. https://github.com/espnet/espnet/blob/master/egs/ljspeech/tts1/conf/tuning/train_pytorch_tacotron2.v3.yaml
I compared forward attention with transition agent and location-sensitive + guided attention. The MOS and CER are almost the same on LJSpeech dataset. (Note that in location-sensitive attention we use attention accumulation.)
@erogol I confirmed that hiss noise is due to mean var normalization, I am using ESPnet FastSpeech architecture and trained with my own preprocessing with -4,4 normalization, I also get noisy hiss like constant sound with GL vocoder and with WaveRNN both. Though I haven't trained WaveRNN with GTA generated by FastSpeech I just used pre-trained LJSpeech with the same pre-processing. I haven't familiar with FastSpeech architecture but I planned to do the same thing with Taco2 (very comfortable with architecture and training) of ESPnet, and still, I get the same result then I definitely like to go deep to architecture and training implemented in ESPnet and ParallelWaveGAN.
@rishikksh20 as far as I understand, you actually did not try ParallelWaveGAN but WaveRNN, is this right? Did you also observe better results with mean var normalization for the same models above?
I forgot to trim silences. And retraining it with silence trimmed, made it better. There is still a slight background noise but, it is there in the original recordings too.
@erogol Do you mean that the reason for hissing noise is caused by the silence part? If so, the difference of the normalization methods did not affect the quality, right?
@kan-bayashi I guess yes, but before being sure, I need to run a model with normalization too. But if I compare the latest mozilla-tts results with the espnet examples above, they are on par.
Actually I realized training more exesperated the results and made the noise more apperant. I'll try mean-var normalization for the next time.
The problem above looks like all with silence trimming. After that, things started to work well.
I can also tell preemphasis helps the model to converge more stable.
If you don't mind, I like to share my progress with PWGAN with TTS.
Here is the first try results: https://soundcloud.com/user-565970875/sets/ljspeech_tacotron_5233_paralle
Results are not better than what we have with WaveRNN, I should say it is much faster.
There is a hissing noise in the backgroung. If you have any idea to get rid of this, please let me know.
The only difference in training (I guess) I don't apply mean-normalization to melspectrograms and I normalize to -4,4 range.