gmltmd789 / UnitSpeech

An official implementation of "UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data"
https://unitspeech.github.io/
Other
126 stars 12 forks source link

Other languages #6

Closed EmreOzkose closed 7 months ago

EmreOzkose commented 7 months ago

Hi, did you try train a different language for voice conversion from pretrained models ? Can you give some hints for this issue, which modules should be re-trained ?

gmltmd789 commented 7 months ago

Hi,

If you wish to implement other languages in the current version, you will need to restart the process outlined in the paper from the beginning.

Assuming you want to expand to Korean:

  1. First, train a backbone TTS model using Korean multi-speaker data. As in the paper, the encoder should be speaker-independent, with the speaker condition only entering the decoder. For the speaker condition, you can use an existing speaker encoder or WavLM, which we used.

If you are curious about the training method, we followed the Grad-TTS training approach. The main difference is that while Grad-TTS used a diffusion prior for the encoder output, we used a zero mean prior.

  1. Then, train a unit encoder that can replace the text encoder for voice conversion and fine-tuning using untranscribed speech. Essentially, units are extracted using HuBERT and the k-means clustering algorithm. For extraction, you can refer to the usage of unit_extractor in the repository's scripts/finetune.py.

The unit encoder directly replaces the text encoder and is trained alongside the decoder as part of the TTS model training objective. During this phase, only the unit encoder is trained while the decoder is frozen.

Afterwards, you can follow the fine-tuning and inference steps as in our repository. However, note that while units are suitable for fine-tuning, they may lead to pronunciation loss when used for actual voice conversion. Therefore, if you seek to improve pronunciation in voice conversion, we recommend training a contentvec encoder, with the same method using for unit encoder. Contentvec is a self-supervised representation that contains more pronunciation information than units.

EmreOzkose commented 7 months ago

thank you for detailed explanation.