SC-CNN : Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
Thanks to StyleSpeech and VITS, we built up our codes based on Link and Link
- VCTK dataset is used.
- Sampling rate is set to 22050Hz.
- This is the implementation of
SC-TransferTTS
Materials
Prerequisites
- Clone this repository.
- Install python requirements. Please refer requirements.txt
- You may need to install espeak first:
apt-get install espeak
- Download datasets
- Download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder:
ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY3
- Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace
Training Exmaple
python train.py -c configs/vctk_base.json -m vctk_base
Inference Example
See inference.ipynb