Voice Conversion UTMOS Results

Thanks for nice work 👍

I have added UTMOS results for voice conversion. The results of UTMOS is very similar to the naturalness MOS (nMOS) for English speech dataset.

I utilized 400 samples I used in HierVST paper. HierVST: [Paper] [Demo]

I used the official implementation for each model and train the model with LibriTTS-Clean-100, 360 (1,151 speakers) and I used the official checkpoint of YourTTS.

Many-to-Many Voice Style Transfer

Training data: LibriTTS-Train-Clean-100, 360
Test data: LibriTTS-Train-Clean-100, 360

Model	nMOS	UTMOS
GT	4.55	3.96
HiFi-GAN	4.17	3.34
AutoVC	2.57	2.77
VoiceMixer	2.84	2.94
DiffVC	3.50	3.17
Speech Resynthesis	2.75	2.79
YourTTS	2.83	2.79
HierVST	4.06	4.05

Zero-shot Voice Style Transfer

Training data: LibriTTS-Train-Clean-100, 360
Test data: VCTK

Model	nMOS	UTMOS
GT	4.42	4.04
HiFi-GAN	4.15	3.54
AutoVC	2.47	3.04
VoiceMixer	2.79	3.19
DiffVC	3.51	3.49
Speech Resynthesis	2.27	3.26
YourTTS	2.69	3.09
HierVST	4.12	4.19

But, I have some questions. when I used UTMOS for High-quality Korean real dataset, the average score of UTMOS is 2.50. Did you have a similar case?

tarepan / SpeechMOS

Voice Conversion UTMOS Results #1

Many-to-Many Voice Style Transfer

Zero-shot Voice Style Transfer