Closed EmreOzkose closed 11 months ago
Hi,
If you wish to implement other languages in the current version, you will need to restart the process outlined in the paper from the beginning.
Assuming you want to expand to Korean:
If you are curious about the training method, we followed the Grad-TTS training approach. The main difference is that while Grad-TTS used a diffusion prior for the encoder output, we used a zero mean prior.
The unit encoder directly replaces the text encoder and is trained alongside the decoder as part of the TTS model training objective. During this phase, only the unit encoder is trained while the decoder is frozen.
Afterwards, you can follow the fine-tuning and inference steps as in our repository. However, note that while units are suitable for fine-tuning, they may lead to pronunciation loss when used for actual voice conversion. Therefore, if you seek to improve pronunciation in voice conversion, we recommend training a contentvec encoder, with the same method using for unit encoder. Contentvec is a self-supervised representation that contains more pronunciation information than units.
thank you for detailed explanation.
Hi, did you try train a different language for voice conversion from pretrained models ? Can you give some hints for this issue, which modules should be re-trained ?