rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.57k stars 478 forks source link

Fine-tuning an English checkpoint (ckpt) using a Chinese speech dataset #613

Open JeffTSaoO opened 1 month ago

JeffTSaoO commented 1 month ago

Hi, I have a 85 hr of Chinese audio voice at 44100 hz to fintuning en-us/lessac/medium .ckpt, but effect not good. And my loss_gen_all looks so high, loss_disc_all looks normal.

Questions:

Sample Rate Conversion: Is it advisable to convert the sample rate from 44,100 Hz to 22,050 Hz before fine-tuning? Could this conversion be contributing to the high loss_gen_all? Language Adaptation: Since I am fine-tuning an English model with Chinese data, are there specific configurations or adjustments you recommend to improve performance? Model Compatibility: Are there any known issues or limitations when fine-tuning the en-us/lessac/medium.ckpt model with a non-English dataset?

Any guidance or suggestions you could provide would be greatly appreciated.

Thank you for your time and assistance.

piper image

Kracozebr commented 3 days ago

What config are you using? May be you are using English phoneme in stand of Chinese?