Why is phoneme level predictor put before length regulator?

I think the following comment might be helpful towards understanding this; however second part of the answer still doesn't provide the exact reason for it; @ming024 might shed more light on this.

> @Liujingxiu23 @WuMing757 When I did the frame-level pitch and energy prediction, the results were not so good and the model tended to predict a constant value for every frame in a phoneme at the testing time since the frame-level hidden features are copied from the same phoneme-level feature. But at training time, the ground-truth pitch and energy values can vary within a phoneme, which differs from the testing time case.
> 
> This problem does not exist in the phoneme-level pitch and energy modeling scenario, so the model performs much better. You may think that how the model knows about the intra-phoneme variation given only a phoneme-level pitch/energy value. I have to say, the decoder is much more powerful than you think and it nails it.

https://github.com/ming024/FastSpeech2/issues/52#issuecomment-827337503

ming024 / FastSpeech2

Why is phoneme level predictor put before length regulator? #116