heatz123 / naturalspeech

A fully working pytorch implementation of NaturalSpeech (Tan et al., 2022)
470 stars 68 forks source link

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

This is an implementation of Microsoft's NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality in Pytorch.

Contribution and pull requests are highly appreciated!

23.02.09: Demo samples (using the first 1800 epochs) are out. (link)

Overview

figure1

Naturalspeech is a VAE-based model that employs several techniques to improve the prior and simplify the posterior. It differs from VITS in several ways, including:

Notes

How to train

0.

    # python >= 3.6
    pip install -r requirements.txt
  1. clone this repository

  2. download The LJ Speech Dataset: link

  3. create symbolic link to ljspeech dataset:

    ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1
  4. text preprocessing (optional, if you are using custom dataset):

    1. apt-get install espeak
    2. python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt
  5. duration preprocessing (obtain duration labels using pretrained VITS):

    If you want to skip this section, use durations/durations.tar.bz2 and overwrite the durations folder.

    1. git clone https://github.com/jaywalnut310/vits.git; cd vits
    2. create symbolic link to ljspeech dataset
      ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1
    3. download pretrained VITS model described as from VITS official github: github link / pretrained models
    4. setup monotonic alignment search (for VITS inference):
      cd monotonic_align; mkdir monotonic_align; python setup.py build_ext --inplace; cd ..
    5. copy duration preprocessing script to VITS repo: cp /path/to/naturalspeech/preprocess_durations.py .
    6. python3 preprocess_durations.py --weights_path ./pretrained_ljs.pth --filelists filelists/ljs_audio_text_train_filelist.txt.cleaned filelists/ljs_audio_text_val_filelist.txt.cleaned filelists/ljs_audio_text_test_filelist.txt.cleaned
    7. once the duration labels are created, copy the labels to the naturalspeech repo: cp -r durations/ path/to/naturalspeech
  6. train (warmup)

    python3 train.py -c configs/ljs.json -m [run_name] --warmup

    Note here that ljs.json is for low-resource training, which runs for 1500 epochs and does not use soft-dtw loss. If you want to reproduce the steps stated in the paper, use ljs_reproduce.json, which runs for 15000 epochs and uses soft-dtw loss.

  7. initialize and attach memory bank after warmup:

      python3 attach_memory_bank.py -c configs/ljs.json --weights_path logs/[run_name]/G_xxx.pth

    if you lack memory, you can specify the --num_samples argument to use only a subset of samples.

  8. train (resume)

      python3 train.py -c configs/ljs.json -m [run_name]

You can use tensorboard to monitor the training.

tensorboard --logdir /path/to/naturalspeech/logs

During each evaluation phase, a selection of samples from the test set is evaluated and saved in the logs/[run_name]/eval directory.

References