Open athenasaurav opened 1 year ago
Hiya, I do some work in TTS and signal processing- my experience has been like yours. I tried training 8kHz models on many architectures and they have a very, very hard time learning a good correspondence between text and spec or text and wav. I can't really say why for sure but my personal theory is that the human voice has many resonant/turbulent frequencies above 4kHz which are important to phone identity. For example, the difference between "s" "th" and "f" is almost entirely high-frequency spectra (see https://home.cc.umanitoba.ca/~robh/howto.html). When we use data sampled at 8kHz, we lose everything above 4kHz (Nyquist frequency). I think without these "clues" from the high frequencies, the model is sort of lost, scared and confused... :) You can see in this paper on LibriTTS (https://www.arxiv-vanity.com/papers/1904.02882/) they discuss why they use a higher sampling rate (24 versus 16 for LibriSpeech) and they say that 16kHz is "too low" to achieve "high quality" TTS. I assume this is a qualitative discussion (they go on to measure WER performance on 16 versus 24 kHz TTS) rather than a discussion about whether or not its literally possible (we know 16 kHz is definitely possible), but still seems to be another clue to be that 8kHz is "too low".
I wonder if the problem is magnified by the use of mel spectrograms?
Thank you @rachel-beeson-connex and @synesthesiam for the reply. I believe if this is not a possible way other than downsampling higher Hz output. But this will increase the latency. Any other suggestions?
I think even if you downsample the parameter size of the model remains the same - maybe synesthesiam can confirm? However, if you want a smaller parameter size model, you can use the --quality flag when you're training and set it to 'x-low'. I did try this setting once for training and the results were decent but not perfect. I think it benefits from a longer training time.
I think even if you downsample the parameter size of the model remains the same - maybe synesthesiam can confirm? However, if you want a smaller parameter size model, you can use the --quality flag when you're training and set it to 'x-low'. I did try this setting once for training and the results were decent but not perfect. I think it benefits from a longer training time.
Thanks @rachel-beeson-connex for giving it a thought. Yes i did tried that and it doesn't work like that.
Hello Everyone,
I am trying to train on 8000Hz using Piper. But the voices are not clear after 4353 epochs. It looks like it's mumbling.
I have my own dataset recorded at 8000Hz.
Here is the sample of original recording
Also the generated audio sound like this
The preprocessing was done like this
And for training, I m running this script
Please remember I m not finetuning a new model, but training from scratch. Should use a pre-trained model at 22050 for my model to be at 8000?
Thank you for your help.