keonlee9420 / Cross-Speaker-Emotion-Transfer

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
MIT License
181 stars 26 forks source link
conditional-layer-normalization cross-speaker deep-neural-networks emotion-transfer generative-model global-style-tokens neural-tts non-ar non-autoregressive parallel-tacotron pytorch semi-supervised-learning speech-synthesis text-to-speech tts

Cross-Speaker-Emotion-Transfer - PyTorch Implementation

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.

Audio Samples

Audio samples are available at /demo.

Quickstart

DATASET refers to the names of datasets such as RAVDESS in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, install fairseq (official document, github) to utilize LConvBlock. Please check here to resolve any issue on installing it. Note that Dockerfile is provided for Docker users, but you have to install fairseq manually.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

To extract soft emotion tokens from a reference audio, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --ref_audio REF_AUDIO_PATH --restore_step RESTORE_STEP --mode single --dataset DATASET

Or, to use hard emotion tokens from an emotion id, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt. Please note that only the hard emotion tokens from a given emotion id are supported in this mode.

Training

Datasets

The supported datasets are

Your own language and dataset can be adapted following here.

Preprocessing

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References