tarepan / SpeechMOS

Easy-to-Use Speech MOS predictors
MIT License
196 stars 13 forks source link

Voice Conversion UTMOS Results #1

Closed sh-lee-prml closed 11 months ago

sh-lee-prml commented 11 months ago

Hi

Thanks for nice work 👍

I have added UTMOS results for voice conversion. The results of UTMOS is very similar to the naturalness MOS (nMOS) for English speech dataset.

I utilized 400 samples I used in HierVST paper. HierVST: [Paper] [Demo]

I used the official implementation for each model and train the model with LibriTTS-Clean-100, 360 (1,151 speakers) and I used the official checkpoint of YourTTS.

Many-to-Many Voice Style Transfer

Model nMOS UTMOS
GT 4.55 3.96
HiFi-GAN 4.17 3.34
AutoVC 2.57 2.77
VoiceMixer 2.84 2.94
DiffVC 3.50 3.17
Speech Resynthesis 2.75 2.79
YourTTS 2.83 2.79
HierVST 4.06 4.05

Zero-shot Voice Style Transfer

Model nMOS UTMOS
GT 4.42 4.04
HiFi-GAN 4.15 3.54
AutoVC 2.47 3.04
VoiceMixer 2.79 3.19
DiffVC 3.51 3.49
Speech Resynthesis 2.27 3.26
YourTTS 2.69 3.09
HierVST 4.12 4.19

But, I have some questions. when I used UTMOS for High-quality Korean real dataset, the average score of UTMOS is 2.50. Did you have a similar case?

tarepan commented 11 months ago

Thanks for sharing UTMOS results!
As the UTMOS paper argued, your UTMOS ranking (≠absolute MOS value) is very similar to that of nMOS.
It is valuable results!

Korean real dataset, the average score of UTMOS is 2.50. Did you have a similar case?

Yes, I have similar problem in Japanese.
When I score clear/high-pitch/whisper-ish normal female voice (つくよみちゃん/Tsukuyomi-chan), the score is around 2.5.
Other Japanese speaker's utterance is scored to 3.8, so this is not just language, but combination of language and speaker.

Other researcher report similar tendency in Japanese (narrow MOS range, lower shift).
This is the link of his tweet.