Tested on Ubuntu 20.04.5 LTS, Python 3.8, Anaconda (2023.03-1) environment
First, install the necessary package for the IPA phonemizer.
sudo apt-get install espeak=1.48.04+dfsg-8build1 espeak-ng=1.50+dfsg-6
If you are unable to install the specific versions of espeak and espeak-ng on Ubuntu 18.04 or earlier, please install the available versions of each package.
Note: If you have a different version of espeak-ng, the output of phonemizing text may vary, which can affect pronunciation accuracy.
After that, create a conda environment and install the unitspeech package and the package required for extracting speaker embeddings.
conda create -n unitspeech python=3.8
conda activate unitspeech
git clone https://github.com/gmltmd789/UnitSpeech.git
cd UnitSpeech
pip install -e .
pip install --no-deps s3prl==0.4.10
We provide the pretrained models. | File Name | Usage |
---|---|---|
contentvec_encoder.pt | Used for any-to-any voice conversion tasks. | |
unit_encoder.pt | Used for fine-tuning and unit-based speech synthesis tasks. (e.g., Adaptive Speech Synthesis for Speech-to-Unit Translation) |
|
text_encoder.pt | Used for adaptive text-to-speech tasks. | |
duration_predictor.pt | Used for adaptive text-to-speech tasks. | |
pretrained_decoder.pt | Used for all adaptive speech synthesis tasks. | |
speaker_encoder.pt | Used for extracting speaker embeddings. | |
bigvgan.pt | Vocoder checkpoint. | |
bigvgan-config.json | Configuration for the vocoder. |
After downloading the files, please arrange them in the following structure.
UnitSpeech/...
unitspeech/...
checkpoints/...
contentvec_encoder.pt
duration_predictor.pt
pretrained_decoder.pt
text_encoder.pt
unit_encoder.pt
...
speaker_encoder/...
checkpts/...
speaker_encoder.pt
...
vocoder/...
checkpts/...
bigvgan.pt
bigvgan-config.json
...
...
...
The decoder is fine-tuned using the target speaker's voice, employing the unit encoder. It is recommended to use a reference English speech with a duration of at least 5~10 seconds.
python scripts/finetune.py \
--reference_path REFERENCE_SPEECH_PATH \
--output_decoder_path FILEPATH1/FINETUNED_DECODER.pt
By executing the code, your personalized decoder will be saved as "FILEPATH1/FINETUNED_DECODER.pt".
With the fine-tuned decoder, you can perform adaptive text-to-speech and any-to-any voice conversion, as described below.
By default, fine-tuning is conducted in fp32 using the Adam optimizer with a learning rate of 2e-5 for 500 iterations.
You can adjust the above elements through arguments provided. (--fp16_run, --learning_rate, --n_iters)
For speakers with unique voices, increasing the number of fine-tuning iterations can help achieve better results.
# script for adaptive text-to-speech
python scripts/text_to_speech.py \
--text "TEXT_TO_GENERATE" \
--decoder_path FILEPATH1/FINETUNED_DECODER.pt \
--generated_sample_path FILEPATH2/PATH_TO_SAVE_SYNTHESIZED_SPEECH.wav
# script for any-to-any voice conversion
python scripts/voice_conversion.py \
--source_path SOURCE_SPEECH_PATH_TO_CONVERT.wav \
--decoder_path FILEPATH1/FINETUNED_DECODER.pt \
--generated_sample_path FILEPATH2/PATH_TO_SAVE_SYNTHESIZED_SPEECH.wav
You can adjust the number of diffusion steps, text gradient scale, and speaker gradient scale as arguments.
By default, text gradient scale is set to 1.0, and speaker gradient scale is set to 1.0.
If you want better pronunciation and audio quality, please increase the value of "text_gradient_scale." This will slightly reduce speaker similarity.
If you want better speaker similarity, please increase the value of "spk_gradient_scale." This will slightly degrade pronunciation accuracy and audio quality.
You can adjust the speed of speaking as arguments. (default: 1.0)
Note: Using excessively large gradient scales can degrade the audio quality.
The code and model weights of UnitSpeech are released under the CC BY-NC-SA 4.0 license.
@misc{kim2023unitspeech,
title={UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data},
author={Heeseung Kim and Sungwon Kim and Jiheum Yeom and Sungroh Yoon},
year={2023},
eprint={2306.16083},
archivePrefix={arXiv},
primaryClass={cs.SD}
}