rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
5.73k stars 408 forks source link

Issue with training at 8000Hz #170

Open athenasaurav opened 1 year ago

athenasaurav commented 1 year ago

Hello Everyone,

I am trying to train on 8000Hz using Piper. But the voices are not clear after 4353 epochs. It looks like it's mumbling.

I have my own dataset recorded at 8000Hz.

Here is the sample of original recording

Also the generated audio sound like this

The preprocessing was done like this

python3 -m piper_train.preprocess \
  --language en-us \
  --input-dir path/to/data \
  --output-dir path/to/data/out-dir \
  --dataset-format ljspeech \
  --single-speaker \
  --sample-rate 8000

And for training, I m running this script

python3 -m piper_train \
    --dataset-dir path/to/data/out-dir \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 32 \
    --validation-split 0.05 \
    --num-test-examples 0 \
    --max_epochs 10000 \
    --checkpoint-epochs 1 \
    --precision 32 \
    --quality high

Please remember I m not finetuning a new model, but training from scratch. Should use a pre-trained model at 22050 for my model to be at 8000?

Thank you for your help.

rachel-beeson-connex commented 1 year ago

Hiya, I do some work in TTS and signal processing- my experience has been like yours. I tried training 8kHz models on many architectures and they have a very, very hard time learning a good correspondence between text and spec or text and wav. I can't really say why for sure but my personal theory is that the human voice has many resonant/turbulent frequencies above 4kHz which are important to phone identity. For example, the difference between "s" "th" and "f" is almost entirely high-frequency spectra (see https://home.cc.umanitoba.ca/~robh/howto.html). When we use data sampled at 8kHz, we lose everything above 4kHz (Nyquist frequency). I think without these "clues" from the high frequencies, the model is sort of lost, scared and confused... :) You can see in this paper on LibriTTS (https://www.arxiv-vanity.com/papers/1904.02882/) they discuss why they use a higher sampling rate (24 versus 16 for LibriSpeech) and they say that 16kHz is "too low" to achieve "high quality" TTS. I assume this is a qualitative discussion (they go on to measure WER performance on 16 versus 24 kHz TTS) rather than a discussion about whether or not its literally possible (we know 16 kHz is definitely possible), but still seems to be another clue to be that 8kHz is "too low".

synesthesiam commented 1 year ago

I wonder if the problem is magnified by the use of mel spectrograms?

athenasaurav commented 1 year ago

Thank you @rachel-beeson-connex and @synesthesiam for the reply. I believe if this is not a possible way other than downsampling higher Hz output. But this will increase the latency. Any other suggestions?

rachel-beeson-connex commented 1 year ago

I think even if you downsample the parameter size of the model remains the same - maybe synesthesiam can confirm? However, if you want a smaller parameter size model, you can use the --quality flag when you're training and set it to 'x-low'. I did try this setting once for training and the results were decent but not perfect. I think it benefits from a longer training time.

athenasaurav commented 1 year ago

I think even if you downsample the parameter size of the model remains the same - maybe synesthesiam can confirm? However, if you want a smaller parameter size model, you can use the --quality flag when you're training and set it to 'x-low'. I did try this setting once for training and the results were decent but not perfect. I think it benefits from a longer training time.

Thanks @rachel-beeson-connex for giving it a thought. Yes i did tried that and it doesn't work like that.