Train voice having 44Khz sampling rate

rhasspy / piper

A fast, local neural text to speech system

https://rhasspy.github.io/piper-samples/

MIT License

6.68k stars 489 forks source link

Train voice having 44Khz sampling rate #604

Open donlk opened 2 months ago

donlk commented 2 months ago

Hi! I have appr. 1.5 hours of audio voice at 44Khz and like to train a usable model from it. I don't want to retrain, as the pre-trained checkpoints are all 22Khz, sounding muddy and not that good. I tried training from scratch, specifying the correct sampling_rate of 44100. Reached 2000 epochs, but the inferred audio was way too fast, skipping words in the process.

What should I modify or patch in to make this work?

thanks!

agonzalezd commented 2 months ago

i suggest resampling your data to 22050 Hz. you can use ffmpeg to do so

donlk commented 2 months ago

I would abstain from that if possible, due to huge quality loss.

Luke100000 commented 1 month ago

Make sure the samplerate is set correctly everywhere, not just training but also inference: https://github.com/search?q=repo%3Arhasspy%2Fpiper%2022050&type=code

Other than that my guess is that you would need to adapt the decoder parameters here: https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30