nii-yamagishilab / multi-speaker-tacotron

VCTK multi-speaker tacotron for ICASSP 2020
BSD 3-Clause "New" or "Revised" License
265 stars 41 forks source link

multi-speaker-tacotron

This is an implementation of our paper from ICASSP 2020:
"Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings," by Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi.
https://arxiv.org/abs/1910.10838
Please cite this paper if you use this code.

Audio samples can be found here: https://nii-yamagishilab.github.io/samples-multi-speaker-tacotron/

News:

Dependencies:

It is recommended to set up a miniconda environment for using Tacotron. https://repo.anaconda.com

conda create -n taco python=3.6.8
conda activate taco
conda install tensorflow-gpu scipy matplotlib docopt hypothesis pyspark unidecode
conda install -c conda-forge librosa
pip install inflect pysptk

Install this repository

git clone https://github.com/nii-yamagishilab/multi-speaker-tacotron.git external/multi_speaker_tacotron

Install Tacotron dependencies if you don't have them already:

mkdir external
git clone https://github.com/nii-yamagishilab/tacotron2.git external/tacotron2
git clone https://github.com/nii-yamagishilab/self-attention-tacotron.git external/self_attention_tacotron

Note the renaming of hyphens to underscores; this is necessary because “-” is an invalid character in Python.

Next, download project data and models, from the dropbox folder here: https://www.dropbox.com/sh/rq4lebus0n8tmso/AACldbmKDPRN9YiXrRROjtTSa?dl=0 The data has been moved to Zenodo. You can find it here: https://zenodo.org/record/6349897#.YkKR-C8Rr0o

Training from scratch using the VCTK data only is possible using the script train_from_scratch.sh; this does not require the Nancy pre-trained model which due to licensing restrictions we are unable to share.

To use our pre-trained WaveNet models, you will also need our WaveNet implementation which can be found here: https://github.com/nii-yamagishilab/project-CURRENNT-scripts

To obtain embeddings for new samples, you will need the neural speaker embedding code which can be found here: https://github.com/jefflai108/pytorch-kaldi-neural-speaker-embeddings

How to use

See the scripts warmup.sh (warm start training), train_from_scratch.sh (train on VCTK data only), and predictmel.sh (prediction). The scripts assume a SLURM-type computing environment. You will need to change the paths to match your environments and point to your data. Here are the parameters relevant to multi-speaker TTS:

The scripts are set up using embedding_file="vctk-x-vector.txt",speaker_embedding_dim='200' which is default x-vectors. Please change it to embedding_file="vctk-lde-3.txt",speaker_embedding_dim='512' to use LDE embeddings from our best system.

Acknowledgments

This work was partially supported by a JST CREST Grant (JPMJCR18A6, VoicePersonae project), Japan, and by MEXT KAKENHI Grants (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051, 19K24372), Japan. The numerical calculations were carried out on the TSUBAME 3.0 supercomputer at the Tokyo Institute of Technology.

Licence

BSD 3-Clause License

Copyright (c) 2020, Yamagishi Laboratory, National Institute of Informatics All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.