PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.
Audio samples are available at /demo.
DATASET refers to the names of datasets such as RAVDESS
in the following documents.
You can install the Python dependencies with
pip3 install -r requirements.txt
Also, install fairseq (official document, github) to utilize LConvBlock
. Please check here to resolve any issue on installing it.
Note that Dockerfile
is provided for Docker
users, but you have to install fairseq manually.
You have to download the pretrained models and put them in output/ckpt/DATASET/
.
To extract soft emotion tokens from a reference audio, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --ref_audio REF_AUDIO_PATH --restore_step RESTORE_STEP --mode single --dataset DATASET
Or, to use hard emotion tokens from an emotion id, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json
, and the generated utterances will be put in output/result/
.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in preprocessed_data/DATASET/val.txt
. Please note that only the hard emotion tokens from a given emotion id are supported in this mode.
The supported datasets are
Your own language and dataset can be adapted following here.
./deepspeaker/pretrained_models/
.Run
python3 prepare_align.py --dataset DATASET
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in preprocessed_data/DATASET/TextGrid/
. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py --dataset DATASET
Train your model with
python3 train.py --dataset DATASET
Useful options:
--use_amp
argument to the above command.CUDA_VISIBLE_DEVICES=<GPU_IDs>
at the beginning of the above command.Use
tensorboard --logdir output/log
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
'none'
and 'DeepSpeaker'
).
Please cite this repository by the "Cite this repository" of About section (top right of the main page).