This is the official implementation of the paper One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. By separately learning speaker and content representations, we can achieve one-shot VC by only one utterance from source speaker and one utterace from target speaker. You can found the demo webpage here, and download the pretrain model from here and the coresponding normalization parameters for inference from here.
The implementations are a little different from the paper, which I found them useful to stablize training process or improve audio quality. However, the experiments requires human evaluation, we only update the code but not updating the paper. The differences are listed below:
We provide the preprocess script for two datasets: VCTK and LibriTTS. The download links are below.
The experiments in the paper is done on VCTK.
The preprocess code is at preprocess/
.
The configuation for preprocessing is at preprocess/libri.config
and preprocess/vctk.config
. Depends on which dataset you used.
where:
LibriTTS/
or VCTK-Corpus/
.Once you edited the config file, you can run preprocess_vctk.sh
or preprocess_libri.sh
to preprocess the dataset.
Also, you can change the feature extraction config in preprocess/tacotron/hyperparams.py
The default arguments can be found in train.sh
. The usage of each arguments are listed below.
config.yaml
.train
if the file is train.pkl). Default: train
train_samples_128.json
You can use inference.py
to inference.
Please cite our paper if you find this repository useful.
@article{chou2019one,
title={One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization},
author={Chou, Ju-chieh and Yeh, Cheng-chieh and Lee, Hung-yi},
journal={arXiv preprint arXiv:1904.05742},
year={2019}
}
If you have any question about the paper or the code, feel free to email me at jjery2243542@gmail.com.