tuanh123789 / Train_Hifigan_XTTS

This is an implementation for train hifigan part of XTTSv2 model using Coqui/TTS.
59 stars 21 forks source link

can finetuning hifigan reduce "hoarseness"? #12

Open kunibald413 opened 1 week ago

kunibald413 commented 1 week ago

The issue I'm experiencing is very similar to what’s described here ("hoarseness" when reaching for higher notes). You can listen to samples in the link below: https://github.com/coqui-ai/TTS/issues/3774.

I fine-tuned xtts v2 (gpt) on a seemingly clean dataset for a single female speaker (~15 minutes of total audio). The results are mostly clean, but sometimes the model struggles on what seems like "higher notes" or "higher pitch"—I'm not exactly sure how to describe it.

I've also noticed this issue with the official checkpoint, even without any fine-tuning, but it happens far more often with female speakers than with male.

I'm unsure how to label this issue. Do you have any ideas on how to debug it? Could fine-tuning the HiFi-GAN on the voice data help resolve it? What's your take on this?

Thank you for your time!

tuanh123789 commented 1 week ago

Not sure, but i think 15 minutes is not enough to finetune Hifigan. But you can try