hcy71o / SC-CNN

SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
MIT License
39 stars 6 forks source link
acoustic-model feature-extractor multi-speaker-tts speech-synthesis text-to-speech tts zero-shot

SC-CNN : Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems

Thanks to StyleSpeech and VITS, we built up our codes based on Link and Link

  1. VCTK dataset is used.
  2. Sampling rate is set to 22050Hz.
  3. This is the implementation of SC-TransferTTS

Materials

Prerequisites

  1. Clone this repository.
  2. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  3. Download datasets
    1. Download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY3
  4. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
    # Cython-version Monotonoic Alignment Search
    cd monotonic_align
    python setup.py build_ext --inplace

    Training Exmaple

    python train.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb