rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.57k stars 315 forks source link

Questions about training and quality #420

Open bzp83 opened 3 months ago

bzp83 commented 3 months ago

Could someone help clarify a few questions for me?

What exactly is the difference between medium and high quality?

I have a fairly extensive dataset, approximately 100 hours of high-quality wave stereo audio at 44100Hz. I also have another dataset of only 3 hours of audio, also in high quality. What would be the benefit of training these datasets in high instead of medium quality? Considering that training in high would take significantly longer and also slow down inference, while I could train in medium with 44100Hz?

Regardless, I need to create synthetic voices from these 2 datasets. What would be the best method? Should I train the larger dataset first and then the smaller one using the checkpoint from the larger dataset? Or should I do it the other way around?

Would it be possible to leverage a checkpoint from a training session using the 100 hours of audio to train smaller datasets (fine tune) and still maintain quality? In other words, could I simply change the "sound of the voice" using a smaller dataset?

Thank you very much for the help and thank you for this fantastic project.