google / visqol

Perceptual Quality Estimator for speech and audio
Apache License 2.0
683 stars 124 forks source link

Do not get the maximum of MOS value using two same audio under speech mode #89

Open Ximoo123 opened 1 year ago

Ximoo123 commented 1 year ago

Hi,Thanks to the good job! When I running in the speech mode with two same audio sampled at 16KHz, the MOS values of many results are around 4.4-4.6, and it did not reach the maximum value of 5.0. However, the NSIM score and similarity of all audio segments are 1.0. Is this a normal phenomenon? I got these results using the SVR model you provided:"lattice_tcditugenmeetpackhref_ls2_nl60_lr12_bs2048_learn.005_ep2400_train1_7_raw.tflite"

mchinen commented 1 year ago

Hi, thanks for the question. Yes, this is expected. If you run a subjective test with the same audio, you will not see 5.0. It depends on the content and raters, but the typical score for ground truth clean wideband audio is 4.5 to 4.75. There is a flag called --use_unscaled_speech_mos_mapping which allowed scaling to 5.0 when set to false, but I think this is has been depricated with recent models (we should open a bug for that).

rsanchezpizani commented 1 year ago

I would say 4.6-4.7 is actually in agreement with ITUs standard. Getting a 5 for a single score from 1 person is possible. Getting a MOS score of 5 is not normal. I am right in thinking that the score assume that this is the MOS score and it is equivalent to run the test with many people? If that is the case a score of 4.6-4.8 is the maximum. if you interview 1000 people and you get a MOS of 5 then it is likely that the comparison/experiment is wrong. So I think a value lower than 5 is correct.

Rodrigo Sanchez-Pizani

Sent from Pixel XL Please accept apologies for brevity and spelling

On Fri, 31 Mar 2023, 01:08 Michael Chinen, @.***> wrote:

Hi, thanks for the question. Yes, this is expected. If you run a subjective test with the same audio, you will not see 5.0. It depends on the content and raters, but the typical score for ground truth clean wideband audio is 4.5 to 4.75. There is a flag called --use_unscaled_speech_mos_mapping which allowed scaling to 5.0 when set to false, but I think this is has been depricated with recent models (we should open a bug for that).

— Reply to this email directly, view it on GitHub https://github.com/google/visqol/issues/89#issuecomment-1491119053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJI6NJXTFNTBX6TB66UFKRDW6YN6VANCNFSM6AAAAAAVYSVN3M . You are receiving this because you are subscribed to this thread.Message ID: @.***>