The issue I'm experiencing is very similar to what’s described here ("hoarseness" when reaching for higher notes). You can listen to samples in the link below:
https://github.com/coqui-ai/TTS/issues/3774.
I fine-tuned xtts v2 (gpt) on a seemingly clean dataset for a single female speaker (~15 minutes of total audio). The results are mostly clean, but sometimes the model struggles on what seems like "higher notes" or "higher pitch"—I'm not exactly sure how to describe it.
I've also noticed this issue with the official checkpoint, even without any fine-tuning, but it happens far more often with female speakers than with male.
I'm unsure how to label this issue. Do you have any ideas on how to debug it? Could fine-tuning the HiFi-GAN on the voice data help resolve it? What's your take on this?
The issue I'm experiencing is very similar to what’s described here ("hoarseness" when reaching for higher notes). You can listen to samples in the link below: https://github.com/coqui-ai/TTS/issues/3774.
I fine-tuned xtts v2 (gpt) on a seemingly clean dataset for a single female speaker (~15 minutes of total audio). The results are mostly clean, but sometimes the model struggles on what seems like "higher notes" or "higher pitch"—I'm not exactly sure how to describe it.
I've also noticed this issue with the official checkpoint, even without any fine-tuning, but it happens far more often with female speakers than with male.
I'm unsure how to label this issue. Do you have any ideas on how to debug it? Could fine-tuning the HiFi-GAN on the voice data help resolve it? What's your take on this?
Thank you for your time!