To train a new voice for English, how many hours of audio do you recommend?

myshell-ai / MeloTTS

High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.

MIT License

4.84k stars 631 forks source link

To train a new voice for English, how many hours of audio do you recommend? #194

Open xiao1ongbao opened 1 month ago

xiao1ongbao commented 1 month ago

To train a new voice for English, how many hours of audio do you recommend? Does the training script train from scratch or finetunes the existing model? Thanks!

iv2985 commented 3 weeks ago

If one takes the G_0.pth (the first checkpoint) during training and uses it for inference, it speaks English with a young female voice that doesn't match the audio clips being trained on. So, it seems that it is fine-tuning that starting point.

As for duration of audio, I have gotten reasonable results with only 5 minutes of audio and 1k epochs with 48khz wav. Most people use 1+ hours, however.