About pitch_predictor of different resolutions

ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

MIT License

1.75k stars 525 forks source link

Thanks for the good job.

When I read the code , A question disturb me from understand it wholy: Why the pitch_predictor can predict pitch under different resolutions?

I see no difference when predict pitches on "phomeme level" or on "frame level", except the mask argument: the former mask length is the length of characters, while the latter has length of mel frames. so Why? the frame level pitch prediction will get a pitch result with the same length of input characters, and then masked_file with zero values?

ming024 / FastSpeech2

About pitch_predictor of different resolutions #162