keonlee9420 / Comprehensive-Transformer-TTS

A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS
MIT License
316 stars 41 forks source link
comprehensive deep-learning fastspeech fastspeech2 hifi-gan mel-gan multi-speaker neural-tts non-ar non-autoregressive pytorch single-speaker sota speech-synthesis supervised text-to-speech transformer tts ultimate-tts unsupervised

Comprehensive-Transformer-TTS - PyTorch Implementation

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS. Any suggestions toward the best Non-AR TTS are welcome :)


Prosody Modelings (WIP)

Supervised Duration Modelings

Unsupervised Duration Modelings

Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)

Model Memory Usage Training Time (1K steps)
Fastformer (lucidrains') 10531MiB / 24220MiB 4m 25s
Fastformer (wuch15's) 10515MiB / 24220MiB 4m 45s
Long-Short Transformer 10633MiB / 24220MiB 5m 26s
Conformer 18903MiB / 24220MiB 7m 4s
Reformer 10293MiB / 24220MiB 10m 16s
Transformer 7909MiB / 24220MiB 4m 51s
Transformer_fs2 11571MiB / 24220MiB 4m 53s

Toggle the type of building blocks by

# In the model.yaml
block_type: "transformer_fs2" # ["transformer_fs2", "transformer", "fastformer", "lstransformer", "conformer", "reformer"]

Toggle the type of prosody modelings by

# In the model.yaml
  model_type: "none" # ["none", "du2021", "liu2021"]

Toggle the type of duration modelings by

# In the model.yaml
  learn_alignment: True # True for unsupervised modeling, and False for supervised modeling


DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.


You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.


You have to download the pretrained models and put them in output/ckpt/DATASET/. The models are trained under unsupervised duration modeling with "transformer_fs2" building block.

For a single-speaker TTS, run

python3 --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.


The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Add --speaker_id SPEAKER_ID for a multi-speaker TTS.



The supported datasets are

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.



Train your model with

python3 --dataset DATASET

Useful options:



tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.



Ablation Study

ID Model Block Type Pitch Conditioning
1 LJSpeech_transformer_fs2_cwt transformer_fs2 continuous wavelet transform
2 LJSpeech_transformer_cwt transformer continuous wavelet transform
3 LJSpeech_transformer_frame transformer frame-level f0
4 LJSpeech_transformer_ph transformer phoneme-level f0

Observations from

  1. changing building block (ID 1~2): "transformer_fs2" seems to be more optimized in terms of memory usage and model size so that the training time and mel losses are decreased. However, the output quality is not improved dramatically, and sometimes the "transformer" block generates speech with an even more stable pitch contour than "transformer_fs2".
  2. changing pitch conditioning (ID 2~4): There is a trade-off between audio quality (pitch stability) and expressiveness.
    • audio quality: "ph" >= "frame" > "cwt"
    • expressiveness: "cwt" > "frame" > "ph"


Updates Log


Please cite this repository by the "Cite this repository" of About section (top right of the main page).
