keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 959 forks source link

The pitch will be lost since it is below the threshold of min_mel_freq in hparam #157

Open begeekmyfriend opened 6 years ago

begeekmyfriend commented 6 years ago

I have noticed that on the tacotron2-work-in-progress branch the min_mel_freq in hparam.py is set as 125Hz that is above the general value of pitch value 100Hz. My colleague says that the pitch will be lost in mel filter banks if the frequency bank does not include 100Hz. He suggest the min_mel_freq be set as 50 or 75.

rafaelvalle commented 6 years ago

It really depends on the vocal range of the speaker. It's very rare to find men who speak below 80hz (low E on the bass) or women who speak below 125hz.

SynthAether commented 6 years ago

If you don't plan on using wavenet or another neural vocoder to drive synthesis (i.e. just use tacotron for synthesis), you could set min_mel_freq to a low value, or back to 0 as Keith's original repo. If you change it don't forget to rerun the data preprocess (i.e. preprocess.py) step so your training data reflects this change.

rafaelvalle commented 6 years ago

@shaunmayberry it is still possible to use wavenet with min_mel_freq and this is exactly what google does in their tacotron 2 paper. For the world vocoder, a model can learn to predict fundamental frequencies that are not present, i.e. missing fundamental. As a matter of fact, humans are also able to do so...

What really matters is that one is only removing frequencies that do not matter. On the lower frequency range that is dependent on the speaker's fundamental frequency lower pound.