karamarieliu / gst_tacotron2_wavenet

13 stars 3 forks source link

Tacotron-2:

Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions

Repository Structure:

Tacotron-2
├── datasets
├── en_UK       (0)
│   └── by_book
│       └── female
├── en_US       (0)
│   └── by_book
│       ├── female
│       └── male
├── LJSpeech-1.1    (0)
│   └── wavs
├── logs-Tacotron   (2)
│   ├── eval_-dir
│   │   ├── plots
│   │   └── wavs
│   ├── mel-spectrograms
│   ├── plots
│   ├── pretrained
│   └── wavs
├── logs-Wavenet    (4)
│   ├── eval-dir
│   │   ├── plots
│   │   └── wavs
│   ├── plots
│   ├── pretrained
│   └── wavs
├── papers
├── tacotron
│   ├── models
│   └── utils
├── tacotron_output (3)
│   ├── eval
│   ├── gta
│   ├── logs-eval
│   │   ├── plots
│   │   └── wavs
│   └── natural
├── wavenet_output  (5)
│   ├── plots
│   └── wavs
├── training_data   (1)
│   ├── audio
│   ├── linear
│   └── mels
└── wavenet_vocoder
    └── models

The previous tree shows the current state of the repository (separate training, one step at a time).

Note:

Model Architecture:

The model described by the authors can be divided in two parts:

To have an in-depth exploration of the model architecture, training procedure and preprocessing logic, refer to our wiki

Current state:

To have an overview of our advance on this project, please refer to this discussion

since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.

How to start

first, you need to have python 3 installed along with Tensorflow.

next you can install the requirements. If you are an Anaconda user: (else replace pip with pip3 and python with python3)

pip install -r requirements.txt

Dataset:

We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)

We are also running current tests on the new M-AILABS speech dataset which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.

After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.

Preprocessing

Before running the following steps, please make sure you are inside Tacotron-2 folder

cd Tacotron-2

Preprocessing can then be started using:

python preprocess.py

dataset can be chosen using the --dataset argument. If using M-AILABS dataset, you need to provide the language, voice, reader, merge_books and book arguments for your custom need. Default is Ljspeech.

Example M-AILABS:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'

or if you want to use all books for a single speaker:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True

This should take no longer than a few minutes.

Training:

To train both models sequentially (one after the other):

python train.py --model='Tacotron-2'

or:

python train.py --model='Both'

Feature prediction model can separately be trained using:

python train.py --model='Tacotron'

checkpoints will be made each 250 steps and stored under logs-Tacotron folder.

Naturally, training the wavenet separately is done by:

python train.py --model='WaveNet'

logs will be stored inside logs-Wavenet.

Note:

Synthesis

To synthesize audio in an End-to-End (text to audio) manner (both models at work):

python synthesize.py --model='Tacotron-2'

For the spectrogram prediction network (separately), there are three types of mel spectrograms synthesis:

python synthesize.py --model='Tacotron' --mode='eval'

python synthesize.py --model='Tacotron' --GTA=False

python synthesize.py --model='Tacotron' --GTA=True

Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms (separately) can be done with:

python synthesize.py --model='WaveNet'

Note:

Pretrained model and Samples:

Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) here.

References and Resources: