Open begeekmyfriend opened 6 years ago
It really depends on the vocal range of the speaker. It's very rare to find men who speak below 80hz (low E on the bass) or women who speak below 125hz.
If you don't plan on using wavenet or another neural vocoder to drive synthesis (i.e. just use tacotron for synthesis), you could set min_mel_freq to a low value, or back to 0 as Keith's original repo. If you change it don't forget to rerun the data preprocess (i.e. preprocess.py) step so your training data reflects this change.
@shaunmayberry it is still possible to use wavenet with min_mel_freq and this is exactly what google does in their tacotron 2 paper. For the world vocoder, a model can learn to predict fundamental frequencies that are not present, i.e. missing fundamental. As a matter of fact, humans are also able to do so...
What really matters is that one is only removing frequencies that do not matter. On the lower frequency range that is dependent on the speaker's fundamental frequency lower pound.
I have noticed that on the tacotron2-work-in-progress branch the
min_mel_freq
inhparam.py
is set as 125Hz that is above the general value of pitch value 100Hz. My colleague says that the pitch will be lost in mel filter banks if the frequency bank does not include 100Hz. He suggest themin_mel_freq
be set as 50 or 75.