PortaSpeech - PyTorch Implementation

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech.

Audio Samples

Audio samples are available at /demo.

Model Size

Module	Normal	Small	Normal (paper)	Small (paper)
Total	24M	7.6M	21.8M	6.7M
LinguisticEncoder	3.7M	1.4M	-	-
VariationalGenerator	11M	2.8M	-	-
FlowPostNet	9.3M	3.4M	-	-

Quickstart

DATASET refers to the names of datasets such as LJSpeech in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The speaking rate of the synthesized utterances can be controlled by specifying the desired duration ratios. For example, one can increase the speaking rate by 20 by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of PortaSpeech.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
Run
```
python3 prepare_align.py --dataset DATASET
```
for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by
```
python3 preprocess.py --dataset DATASET
```
Training

Train your model with
```
python3 train.py --dataset DATASET
```
Useful options:
- To use Automatic Mixed Precision, append --use_amp argument to the above command.
- The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.
TensorBoard

Use
```
tensorboard --logdir output/log
```
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Normal Model

Small Model Loss

Notes
- For vocoder, HiFi-GAN and MelGAN are supported.
- No ReLU activation and LayerNorm in VariationalGenerator to avoid mashed output.
- Speed up the convergence of word-to-phoneme alignment in LinguisticEncoder by dividing long words into subwords and sorting the dataset by mel-spectrogram frame length.
- There are two kinds of helper loss to improve word-to-phoneme alignment: "ctc" and "dga". You can toggle them as follows:
```
# In the train.yaml
aligner:
    helper_type: "dga" # ["dga", "ctc", "none"]
```
  - "dga": Diagonal Guided Attention (DGA) Loss
  - "ctc": Connectionist Temporal Classification (CTC) Loss with forward-sum algorithm
  - If you set "none", no helper loss will be applied during training.
  - The alignments comparision of three methods ("dga", "ctc", and "none" from top to bottom):
  - The default setting is "dga". Although "ctc" makes the strongest alignment, the output quality and the accuracy are worse than "dga".
  - But still, there is a room for the improvement of output quality. The audio quality and the alingment (accuracy) seem to be a trade-off.
- Will be extended to a multi-speaker TTS. # Citation Please cite this repository by the "[Cite this repository](https://github.blog/2021-08-19-enhanced-support-citations-github/)" of **About** section (top right of the main page). # References - [jaywalnut310's VITS](https://github.com/jaywalnut310/vits) - [jaywalnut310's Glow-TTS](https://github.com/jaywalnut310/glow-tts) - [keonlee9420's VAENAR-TTS](https://github.com/keonlee9420/VAENAR-TTS) - [keonlee9420's Comprehensive-Transformer-TTS](https://github.com/keonlee9420/Comprehensive-Transformer-TTS) (CTC Loss) - [keonlee9420's Comprehensive-Tacotron2](https://github.com/keonlee9420/Comprehensive-Tacotron2) (DGA Loss)

keonlee9420 / PortaSpeech

readme

PortaSpeech - PyTorch Implementation

Audio Samples

Model Size

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Training

TensorBoard

Normal Model

Small Model Loss

Notes