Tensorflow Unofficial Implementation of HIERARCHICAL GENERATIVE MODELING FOR CONTROLLABLE SPEECH SYNTHESIS
Tacotron-2
├── datasets
├── LJSpeech-1.1 (0)
│ └── wavs
├── logs-Tacotron (2)
│ ├── mel-spectrograms
│ ├── plots
│ ├── pretrained
│ └── wavs
├── papers
├── tacotron
│ ├── models
│ └── utils
├── tacotron_output (3)
│ ├── eval
│ ├── gta
│ ├── logs-eval
│ │ ├── plots
│ │ └── wavs
│ └── natural
└── training_data (1)
├── audio
└── mels
The previous tree shows what the current state of the repository.
first, you need to have python 3.5 installed along with Tensorflow v1.6.
next you can install the requirements :
pip install -r requirements.txt
else:
pip3 install -r requirements.txt
This repo tested on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording.
Before running the following steps, please make sure you are inside Tacotron-2 folder
cd Tacotron-2
Preprocessing can then be started using:
python preprocess.py
or
python3 preprocess.py
dataset can be chosen using the --dataset argument. Default is Ljspeech.
Feature prediction model can be trained using:
python train.py --model='Tacotron'
or
python3 train.py --model='Tacotron'
There are three types of mel spectrograms synthesis for the Spectrogram prediction network (Tacotron):
python synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'
or
python3 synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'
Note:
eval
mode.Blizzard 2013 voice dataset
though author of the paper used 105 hrs of Blizzard Challenge 2013 dataset.wavenet
as well as WaveRNN
.TODO
Work in progress