Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
2021.05.25: Only the soft-DTW remains the last hurdle!
Following the author's advice on the implementation, I took several tests on each module one by one under a supervised duration signal with L1 loss (FastSpeech2). Until now, I can confirm that all modules except soft-DTW are working well as follows (Synthesized Spectrogram, GT Spectrogram, Residual Alignment, and W from LearnedUpsampling from top to bottom).
For the details, please check the latest commit log and the updated Implementation Issues section. Also, you can find the ongoing experiments at https://github.com/keonlee9420/FastSpeech2/commits/ptaco2.
2021.05.15: Implementation done. Sanity checks on training and inference. But still the model cannot converge.
I'm waiting for your contribution!
Please inform me if you find any mistakes in my implementation or any valuable advice to train the model successfully. See the Implementation Issues section.
You can install the Python dependencies with
pip3 install -r requirements.txt
Install fairseq (official document, github) to utilize LConvBlock
. Please check #5 to resolve any issue on installing.
The supported datasets:
After downloading the datasets, set the corpus_path
in preprocess.yaml
and run the preparation script:
python3 prepare_data.py config/LJSpeech/preprocess.yaml
Then, run the preprocessing script:
python3 preprocess.py config/LJSpeech/preprocess.yaml
Train your model with
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
The model cannot converge yet. I'm debugging but it would be boosted if your awesome contribution is ready!
For a single inference, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
The generated utterances will be saved in output/result/
.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
to synthesize all utterances in preprocessed_data/LJSpeech/val.txt
.
Use
tensorboard --logdir output/log/LJSpeech
to serve TensorBoard on your localhost.
Overall, normalization or activation, which is not suggested in the original paper, is adequately arranged to prevent NaN value (gradient) on forward and backward calculations. (NaN indicates that something is wrong in the network)
FFTBlock
of FastSpeech2 for the transformer block of the text encoder.0.2
for the ConvBlock
of the text encoder.grapheme_to_phoneme
function. (See ./text/init).80 channels
mel-spectrogrom instead of 128-bin
.nn.SiLU()
for the swish activation.W
and C
, concatenation operation is applied among S
, E
, and V
after frame-domain (T domain) broadcasting of V
.LConvBlock
and regular sinusoidal positional embedding.nn.Tanh()
to each LConvBLock
output (following activation pattern of decoder part in FastSpeech2).model/soft_dtw_cuda.py
, reflecting the recursion suggested in the original paper.E
is computed. But employed as a loss function, jacobian product is added to return target derivetive of R
w.r.t. input X
.8
in 24GiB GPU (TITAN RTX) due to space complexity problem in soft-DTW Loss.
@misc{lee2021parallel_tacotron2,
author = {Lee, Keon},
title = {Parallel-Tacotron2},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keonlee9420/Parallel-Tacotron2}}
}