gmltmd789 / UnitSpeech

An official implementation of "UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data"
https://unitspeech.github.io/
Other
129 stars 12 forks source link
diffusion pytorch text-to-speech unitspeech voice-conversion

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data (INTERSPEECH 2023, Oral)

Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon

model-1

Open In Colab

Paper

Audio demo

Updates

2023.07.04 : We changed the normalization method for better speaker similarity.

2023.06.29 : We update our code and checkpoints for better pronunciation.

2023.06.28 : Updated components compared to the version of INTERSPEECH.

Warning: Ethical & Legal Considerations

  1. UnitSpeech was created with the primary objective of facilitating research endeavors.
  2. When utilizing samples generated using this model, it is crucial to clearly disclose that the samples were generated using AI technology. Additionally, it is necessary to provide the sources of the audio used in the generation process.
  3. We notify that users take full responsibility for any possible negative outcomes and legal & ethical issues that may arise due to their misuse of the model.
  4. As a precautionary measure against possible misapplication, we intend to introduce a classification model capable of discerning samples generated through the utilization of this model.

TO DO

Installation

Tested on Ubuntu 20.04.5 LTS, Python 3.8, Anaconda (2023.03-1) environment
First, install the necessary package for the IPA phonemizer.

sudo apt-get install espeak=1.48.04+dfsg-8build1 espeak-ng=1.50+dfsg-6

If you are unable to install the specific versions of espeak and espeak-ng on Ubuntu 18.04 or earlier, please install the available versions of each package.
Note: If you have a different version of espeak-ng, the output of phonemizing text may vary, which can affect pronunciation accuracy.

After that, create a conda environment and install the unitspeech package and the package required for extracting speaker embeddings.

conda create -n unitspeech python=3.8
conda activate unitspeech
git clone https://github.com/gmltmd789/UnitSpeech.git
cd UnitSpeech
pip install -e .
pip install --no-deps s3prl==0.4.10

Pretrained Models

We provide the pretrained models. File Name Usage
contentvec_encoder.pt Used for any-to-any voice conversion tasks.
unit_encoder.pt Used for fine-tuning and unit-based speech synthesis tasks.
(e.g., Adaptive Speech Synthesis for Speech-to-Unit Translation)
text_encoder.pt Used for adaptive text-to-speech tasks.
duration_predictor.pt Used for adaptive text-to-speech tasks.
pretrained_decoder.pt Used for all adaptive speech synthesis tasks.
speaker_encoder.pt Used for extracting speaker embeddings.
bigvgan.pt Vocoder checkpoint.
bigvgan-config.json Configuration for the vocoder.

After downloading the files, please arrange them in the following structure.

UnitSpeech/...
    unitspeech/...
        checkpoints/...
            contentvec_encoder.pt
            duration_predictor.pt
            pretrained_decoder.pt
            text_encoder.pt
            unit_encoder.pt
            ...
        speaker_encoder/...
            checkpts/...
                speaker_encoder.pt
            ...
        vocoder/...
            checkpts/...
                bigvgan.pt
                bigvgan-config.json
            ...
        ...
    ...

Fine-tuning

The decoder is fine-tuned using the target speaker's voice, employing the unit encoder. It is recommended to use a reference English speech with a duration of at least 5~10 seconds.

python scripts/finetune.py \
--reference_path REFERENCE_SPEECH_PATH \
--output_decoder_path FILEPATH1/FINETUNED_DECODER.pt

By executing the code, your personalized decoder will be saved as "FILEPATH1/FINETUNED_DECODER.pt".
With the fine-tuned decoder, you can perform adaptive text-to-speech and any-to-any voice conversion, as described below.

By default, fine-tuning is conducted in fp32 using the Adam optimizer with a learning rate of 2e-5 for 500 iterations.
You can adjust the above elements through arguments provided. (--fp16_run, --learning_rate, --n_iters)
For speakers with unique voices, increasing the number of fine-tuning iterations can help achieve better results.

Inference

# script for adaptive text-to-speech
python scripts/text_to_speech.py \
--text "TEXT_TO_GENERATE" \
--decoder_path FILEPATH1/FINETUNED_DECODER.pt \
--generated_sample_path FILEPATH2/PATH_TO_SAVE_SYNTHESIZED_SPEECH.wav

# script for any-to-any voice conversion
python scripts/voice_conversion.py \
--source_path SOURCE_SPEECH_PATH_TO_CONVERT.wav \
--decoder_path FILEPATH1/FINETUNED_DECODER.pt \
--generated_sample_path FILEPATH2/PATH_TO_SAVE_SYNTHESIZED_SPEECH.wav

You can adjust the number of diffusion steps, text gradient scale, and speaker gradient scale as arguments.

By default, text gradient scale is set to 1.0, and speaker gradient scale is set to 1.0.
If you want better pronunciation and audio quality, please increase the value of "text_gradient_scale." This will slightly reduce speaker similarity.
If you want better speaker similarity, please increase the value of "spk_gradient_scale." This will slightly degrade pronunciation accuracy and audio quality.

You can adjust the speed of speaking as arguments. (default: 1.0)

Note: Using excessively large gradient scales can degrade the audio quality.

License

The code and model weights of UnitSpeech are released under the CC BY-NC-SA 4.0 license.

References

Citation

@misc{kim2023unitspeech,
      title={UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data}, 
      author={Heeseung Kim and Sungwon Kim and Jiheum Yeom and Sungroh Yoon},
      year={2023},
      eprint={2306.16083},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}