Emo-StarGAN
This repository contains the source code of the paper Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion, accepted in Interspeech 2023. An overview of the method and the results can be found here.
Highlights:
- Emo-StarGAN: An emotion-preserving deep semi-supervised voice conversion-based speaker anonymisation method is proposed.
- Emotion supervision techiniques are proposed: (a) Direct: using emotion classifier (b) Indirect: using losses leveraging acoustic features and deep features which represent the emotional content of the source and converted samples.
- The indirect techniques can also be used in the absence of emotion labels.
- Experiments demonstrate its generalizability on the following benchmark datasets, across different accents, genders, emotions and cross-corpus conversions:
Samples
Samples can be found here.
Demo
The demo can be found at Demo/EmoStarGAN Demo.ipynb.
Pre-requisites:
- Python >= 3.9
- Install the python dependencies mentioned in the requirements.txt
Training:
Before Training
- Before starting the training, please specify the number of target speakers in
num_speaker_domains
and other details such as training and validation data in the config file.
- Download VCTK and ESD datasets. For VCTK dataset preprocessing is needed, which can be carried out using Preprocess/getdata.py. The dataset paths need to be adjusted in train
train_list.txt
and validation val_list.txt
lists present in Data/.
- Download and copy the emotion embeddings weights to the folder Utils/emotion_encoder
- Download and copy the vocoder weights to the folder Utils/Vocoder
Train
python train.py --config_path ./Configs/speaker_domain_config.yml
Model Weights
The Emo-StarGAN model weight can be downloaded from here.
Common Errors
When the speaker index in train_list.txt
or val_list.txt
is greater than the number of speakers ( the hyperparameter num_speaker_domains
mentioned in speaker_domain_config.yml
), the following error is encountered:
[train]: 0%| | 0/66 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Also note that the speaker index starts with 0 (not with 1!) in the training and validation lists.
References and Acknowledgements