ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.75k stars 525 forks source link

Why is phoneme level predictor put before length regulator? #116

Open jerryuhoo opened 2 years ago

jerryuhoo commented 2 years ago

Why are pitch and energy predictors for phoneme level processed before length regulator? If anyone knows, please help me. Thank you!

lordzuko commented 1 year ago

I think the following comment might be helpful towards understanding this; however second part of the answer still doesn't provide the exact reason for it; @ming024 might shed more light on this.

> @Liujingxiu23 @WuMing757 When I did the frame-level pitch and energy prediction, the results were not so good and the model tended to predict a constant value for every frame in a phoneme at the testing time since the frame-level hidden features are copied from the same phoneme-level feature. But at training time, the ground-truth pitch and energy values can vary within a phoneme, which differs from the testing time case.
> 
> This problem does not exist in the phoneme-level pitch and energy modeling scenario, so the model performs much better. You may think that how the model knows about the intra-phoneme variation given only a phoneme-level pitch/energy value. I have to say, the decoder is much more powerful than you think and it nails it.

https://github.com/ming024/FastSpeech2/issues/52#issuecomment-827337503