keonlee9420 / DiffSinger

PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech)
MIT License
230 stars 30 forks source link
ddpm diffsinger diffusion diffusion-models english fastspeech neural-tts non-autoregressive pytorch singing-voice speech-synthesis text-to-speech tts

DiffSinger - PyTorch Implementation

PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech).

Repository Status

Quickstart

DATASET refers to the names of datasets such as LJSpeech in the following documents.

MODEL refers to the types of model (choose from 'naive', 'aux', 'shallow').

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of DiffSpeech.

Training

Datasets

The supported datasets are

Preprocessing

First, run

python3 prepare_align.py --dataset DATASET

for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by

  python3 preprocess.py --dataset DATASET

Training

You can train three types of model: 'naive', 'aux', and 'shallow'.

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Naive Diffusion

Shallow Diffusion

Loss Comparison

Notes

  1. (Naive version of DiffSpeech) The number of learnable parameters is 27.767M, which is similar to the original paper (27.722M).
  2. Unfortunately, the predicted boundary (of LJSpeech) for the shallow diffusion in the current implementation is 100, which is the full timesteps of the naive diffusion so that there is no advantage on diffusion steps.
  3. Use HiFi-GAN instead of Parallel WaveGAN (PWG) for vocoding.

Citation

@misc{lee2021diffsinger,
  author = {Lee, Keon},
  title = {DiffSinger},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/DiffSinger}}
}

References