ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.75k stars 525 forks source link

About pitch_predictor of different resolutions #162

Open JohnHerry opened 2 years ago

JohnHerry commented 2 years ago

Thanks for the good job.

When I read the code , A question disturb me from understand it wholy: Why the pitch_predictor can predict pitch under different resolutions?

I see no difference when predict pitches on "phomeme level" or on "frame level", except the mask argument: the former mask length is the length of characters, while the latter has length of mel frames. so Why? the frame level pitch prediction will get a pitch result with the same length of input characters, and then masked_file with zero values?

hhm853610070 commented 1 year ago

have you tried the feature of pitch and energy with "frame_level"? I hava tried that configuration but the result is terrible.There are many noises and wrong pronounciation within the audio while in the inference,but the audio synthesized of validation in the training step is good.Do you know why there is such a big difference?