A Non-Autoregressive End-to-End Text-to-Speech (generating waveform given text), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS. Any suggestions toward the best End-to-End TTS are welcome :)
DATASET refers to the names of datasets such as LJSpeech
and VCTK
in the following documents.
You can install the Python dependencies with
pip3 install -r requirements.txt
Also, Dockerfile
is provided for Docker
users.
You have to download the [pretrained models]() (will be shared soon) and put them in output/ckpt/DATASET/
.
For a single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
For a multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json
, and the generated utterances will be put in output/result/
.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in preprocessed_data/DATASET/val.txt
.
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8
Add --speaker_id SPEAKER_ID for a multi-speaker TTS.
The supported datasets are
Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.
./deepspeaker/pretrained_models/
.python3 preprocess.py --dataset DATASET
Train your model with
python3 train.py --dataset DATASET
Useful options:
CUDA_VISIBLE_DEVICES=<GPU_IDs>
at the beginning of the above command.Use
tensorboard --logdir output/log
to serve TensorBoard on your localhost.
'none'
and 'DeepSpeaker'
).
Please cite this repository by the "Cite this repository" of About section (top right of the main page).