About MOS evaluation in the paper

nuts-kun commented 1 year ago

Hi! Thank you for your excellent work! I have a question about the MOS evaluation in the paper. In the paper, there are 95% confidence interval in the MOS assessment of roughly 2, which I think is quite large. The MOS in the original BigVGAN paper is roughly 0.1. This paper does not mention this point, but if there is any reason, could you please let me know? Thanks.

yahshibu commented 1 year ago

Thank you for having an interest in our work! That's a good question.

Differences in generation quality between different vocoders (including recorded ground truth) are small now because BigVGAN/SAN can synthesize high-fidelity audio samples. And, this makes differences in recording quality between LibriTTS recorded samples relatively larger. In previous studies, raters should've had a strong tendency to give high scores to ground-truth samples and low scores to synthesized ones. In our test, raters might have struggled to find differences between provided samples and then sometimes gave low scores to ground-truth samples.

According to the paper https://arxiv.org/abs/2305.10608, listeners tend to use the entire range from 1 to 5 even when they are provided with only high-quality samples in a listening test. Our observation matches the phenomenon explained in the paper.

This is what happened. Does it make sense?

nuts-kun commented 1 year ago

Thank you for your fast reply! Yes, I think what you say is one of the causes, too. In that case, I think the MOS for the test-other dataset should be generally lower and the MOS for the test-clean dataset should be generally higher. How was this point?

TakashiShibuyaSony commented 1 year ago

(I wrongly replied with my private account yesterday. Sorry for the caused confusion.)

That's a good point. Unfortunately, I can't trace numbers regarding the test-clean/test-other sets. Instead, I've confirmed that MOS scores for each speech sample are differently distributed. MOS scores for one speech sample are

Ground truth: 4.62 ± 0.95 BigVGAN (our reprodcution): 3.75 ± 1.30 BigVSAN: 4.38 ± 1.36 BigVSAN (w/ snakebeta): 4.38 ± 0.95

, ane those for another speech sample are

Ground truth: 2.62 ± 1.94 BigVGAN (our reprodcution): 2.50 ± 2.40 BigVSAN: 2.50 ± 1.96 BigVSAN (w/ snakebeta): 2.75 ± 2.14

. The former one should be a clean speech sample, and the latter one should be a noisy sample.

Thank you for your insightful comment.

nuts-kun commented 1 year ago

Thanks for sharing your results :) I see when clean and noisy speech is used in the same MOS experiment, it may not be possible to properly evaluate how close it is to natural speech. This finding will be helpful in designing future experiments. Thank you very much for the useful information and constructive discussion!

sh-lee-prml commented 1 year ago

Thanks for nice work :)

I also have the same question for the CI of MOS.

In my opinion, the authors may compute the confidence interval for each listener. However, in general, the confidence interval is computed for total samples the listeners rated, so I think the high CI is reported in this paper.

Thank you again for very nice work!

TakashiShibuyaSony commented 1 year ago

@sh-lee-prml Thank you very much for your comment! We asked 8 listeners to rate 10 samples for each model (we have 4 models to be evaluated, so each listener rated 40 samples in total). Then, we got 80 ratings per model. We calculated averages and CIs using the 80 scores.

sh-lee-prml commented 1 year ago

Thanks for details.

Then, I think it's right :)

and there is a easy MOS prediction model (https://github.com/tarepan/SpeechMOS).

It would be better if you increase the samples for MOS evaluation by using simple MOS prediction model!

Thank you 👍

TakashiShibuyaSony commented 1 year ago

@sh-lee-prml I appreciate your helpful comment. Thank you!

sony / bigvsan

About MOS evaluation in the paper #1