Unofficial Implementation of Learning latent representations for style control and transfer in end-to-end speech synthesis
Tacotron-2
├── datasets
├── LJSpeech-1.1 (0)
│ └── wavs
├── logs-Tacotron (2)
│ ├── mel-spectrograms
│ ├── plots
│ ├── pretrained
│ └── wavs
├── papers
├── tacotron
│ ├── models
│ └── utils
├── tacotron_output (3)
│ ├── eval
│ ├── gta
│ ├── logs-eval
│ │ ├── plots
│ │ └── wavs
│ └── natural
└── training_data (1)
├── audio
└── mels
The previous tree shows what the current state of the repository.
first, you need to have python 3.5 installed along with Tensorflow v1.6.
next you can install the requirements :
pip install -r requirements.txt
else:
pip3 install -r requirements.txt
This repo tested on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording.
Before running the following steps, please make sure you are inside Tacotron-2 folder
cd Tacotron-2
Preprocessing can then be started using:
python preprocess.py
or
python3 preprocess.py
dataset can be chosen using the --dataset argument. Default is Ljspeech.
Feature prediction model can be trained using:
python train.py --model='Tacotron'
or
python3 train.py --model='Tacotron'
There are three types of mel spectrograms synthesis for the Spectrogram prediction network (Tacotron):
python synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'
or
python3 synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'
Note:
eval
mode.Blizzard 2013 voice dataset
though author of the paper used 105 hrs of Blizzard Challenge 2013 dataset.wavenet
as well as WaveRNN
.TODO Claimed Samples from research paper : http://home.ustc.edu.cn/~zyj008/ICASSP2019
Work in progress