Open Ximoo123 opened 1 year ago
Hi, thanks for the question. Yes, this is expected. If you run a subjective test with the same audio, you will not see 5.0. It depends on the content and raters, but the typical score for ground truth clean wideband audio is 4.5 to 4.75. There is a flag called --use_unscaled_speech_mos_mapping
which allowed scaling to 5.0 when set to false, but I think this is has been depricated with recent models (we should open a bug for that).
I would say 4.6-4.7 is actually in agreement with ITUs standard. Getting a 5 for a single score from 1 person is possible. Getting a MOS score of 5 is not normal. I am right in thinking that the score assume that this is the MOS score and it is equivalent to run the test with many people? If that is the case a score of 4.6-4.8 is the maximum. if you interview 1000 people and you get a MOS of 5 then it is likely that the comparison/experiment is wrong. So I think a value lower than 5 is correct.
Rodrigo Sanchez-Pizani
Sent from Pixel XL Please accept apologies for brevity and spelling
On Fri, 31 Mar 2023, 01:08 Michael Chinen, @.***> wrote:
Hi, thanks for the question. Yes, this is expected. If you run a subjective test with the same audio, you will not see 5.0. It depends on the content and raters, but the typical score for ground truth clean wideband audio is 4.5 to 4.75. There is a flag called --use_unscaled_speech_mos_mapping which allowed scaling to 5.0 when set to false, but I think this is has been depricated with recent models (we should open a bug for that).
— Reply to this email directly, view it on GitHub https://github.com/google/visqol/issues/89#issuecomment-1491119053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJI6NJXTFNTBX6TB66UFKRDW6YN6VANCNFSM6AAAAAAVYSVN3M . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi,Thanks to the good job! When I running in the speech mode with two same audio sampled at 16KHz, the MOS values of many results are around 4.4-4.6, and it did not reach the maximum value of 5.0. However, the NSIM score and similarity of all audio segments are 1.0. Is this a normal phenomenon? I got these results using the SVR model you provided:"lattice_tcditugenmeetpackhref_ls2_nl60_lr12_bs2048_learn.005_ep2400_train1_7_raw.tflite"