gmltmd789 / UnitSpeech

An official implementation of "UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data"
https://unitspeech.github.io/
Other
131 stars 12 forks source link

How to retrain used pts? Like test encoder? #1

Closed zdj97 closed 1 year ago

zdj97 commented 1 year ago

Hi, thanks to your works for tts! I want to know how to train all pts from scratch like text encoder and pretrained decoder? By the way, does this scripts support multi speaker tts and vc?

gmltmd789 commented 1 year ago

Hello,

Currently, the repository provides the following items:

  1. Code that allows for inference.
  2. Checkpoints that enable inference.
  3. A Google Colaboratory link that facilitates easy inference.

As of now, the training code required for each checkpoint is not available. (We are planning to make it public, but the exact timing is yet to be determined.)

Furthermore, the current code supports multi-speaker TTS and VC inference for English, so I recommend taking a look at the Colab link for a quick exploration.

If you need any further assistance, feel free to ask!

zdj97 commented 1 year ago

Thanks! I will follow this repo.

zdj97 commented 1 year ago

And another question. The packages s3pel need omegaconf>=2.1.1, but fairseq 0.12.2 need omegaconf<2.1; I can not solve this question.

gmltmd789 commented 1 year ago

There is a dependency issue with the s3prl package.

Please add "--no-deps" to the installation command to ignore the dependencies.

pip install --no-deps s3prl==0.4.10
zdj97 commented 1 year ago

Yes, i follow this readme. BUT it seems the incompatible still exits. However, this dose not influence inferencing. AND THANKS for your help!

zdj97 commented 1 year ago

Hi, i tried this repo for several English persons. The performance is excellent not only for tts but also VC ! VERRY COOL ! However, i still want to know how to train pts from scratch while i found there exits superior potential of this repo in many other fileds. Sorry for asking questions blew:
1、This repo uses Grad-TTS as a backbone, but i do not found text_encoder and duration_predictor in Grad-TTS repo: https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS/train_multi_speaker.py, as i did not use Grad-TTS before. 2、Could the unit extractor be fine tuned in other languages or trained from sratch? 3、Could you offer some steps to train pts so as i could train it bymyself?

zdj97 commented 1 year ago

I studied Grad-TTS and found duration predictor.

gmltmd789 commented 1 year ago

Hi,

  1. We separated the text encoder, duration predictor, and decoder to enhance the convenience of inference in our model. In the case of Grad-TTS, they don't separate the model like us, so the text encoder and duration predictor are included in the GradTTS class. Please refer to the model/tts.py file in the Grad-TTS repository for more information!

  2. As for the unit extractor, we used the multilingual HuBERT introduced by Facebook. It has been trained on English, Spanish, and French, and we assume it may work reasonably well for other languages as well. So, you can either fine-tune it on your target language or train it from scratch. However, we recommend trying the pre-trained unit extractor checkpoint first without any additional training for the target unseen language. If you are not satisfied with the results, then we suggest considering additional fine-tuning (or training from scratch) of the unit extractor for the target language.

  3. The backbone TTS model has been trained for approximately 1.8M iterations with batch size 32, while the unit encoder and contentvec encoder were trained for about 0.8M iterations with batch size 32. However, for two encoders, there wasn't a significant difference in performance between the models trained with fewer iterations and those trained for about 0.8M.

zdj97 commented 1 year ago

Yes! Thanks for your help.