Code for this paper Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition
Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna
Accepted by INTERSPEECH 2020
This is a TensorFlow + Pytorch implementation. This implementation is adapted from the Real Time Voice Clone implementation at https://github.com/CorentinJ/Real-Time-Voice-Cloning.
pip install -r requirements.txt
We use the speaker encoder model and vocoder model from here. We only train the voice conversion model (i.e., synthesizer).
Before running, put the speaker encoder and vocoder at encoder/saved_models/pretrained.pt
and vocoder/saved_models/pretrained/pretrained.pt
<dataset_root>/VCTK/train/p227
and <dataset_root>/VCTK/test/p228
python synthesizer_preprocess_audio.py <datasets_root>
python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer_train
python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer_test
To launch training:
$ python synthesizer_train.py vc_adversarial <datasets_root>/SV2TTS/synthesizer_train
To run inference, use synthesis_ppg_script.py
. Change the syn_dir
to the path of the trained model, e.g., synthesizer/saved_models/logs-train_adversarial_vctk/taco_pretrained
The code is adapted from CorentinJ / Real-Time-Voice-Cloning.
@article{dingimproving,
title={Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition},
author={Ding, Shaojin and Zhao, Guanlong and Gutierrez-Osuna, Ricardo}
}