Open rolandsjlociks opened 2 years ago
Hi, thank you for the question, and apologies for not replying until now.
I believe that the times listed are the original timestamps of the file, before they are aligned. ViSQOL might behave poorly if the global delay is greater than half a patch. This will probably cause there to be some reference patches that don't have nice matches. https://github.com/google/visqol/blob/master/src/alignment.cc#L65
However, given that your audio has silence to start, I'm not sure how much of an issue it is, since the VAD should not be active. I'm not sure if this helps, but let me know if I can provide more info.
Hello!
I have some questions regarding the behaviour of the patch detection and alignment.
Firstly I would like to clarify - in Visqol's --verbose output, the patch start and end times - are the times displayed before or after the global alignment part of Visqol? IE if there is an overall 2 second delay in the degraded patch, which to my understanding gets detected during the global alignment, will the patch time output show post or pre-alignment times?
Secondly - the sample guidelines mentioned in Readme (8-10 seconds long, 0.5 seconds of silence at the beginning and end, not much silence in the middle of sample) - does this apply to Speech mode too, or should the alignment and voice detection of speech mode handle audio with delay, and with a lot of silence in the sample?
An example of a samples that I am currently trying to use with Visqol: View from audacity:
Visqol Speech mode output MOS-LQO: 2.52259
| FVNSIM | Freq Band |
| 0.377030 | 50.000Hz | | 0.476088 | 98.767Hz | | 0.459481 | 156.063Hz | | 0.763644 | 223.380Hz | | 0.835369 | 302.471Hz | | 0.923244 | 395.394Hz | | 0.926039 | 504.570Hz | | 0.903042 | 632.839Hz | | 0.884796 | 783.543Hz | | 0.844095 | 960.604Hz | | 0.841896 | 1168.633Hz | | 0.866645 | 1413.046Hz | | 0.860731 | 1700.205Hz | | 0.856387 | 2037.587Hz | | 0.825984 | 2433.977Hz | | 0.823645 | 2899.694Hz | | 0.745470 | 3446.863Hz | | 0.706072 | 4089.731Hz | | 0.700011 | 4845.034Hz | | 0.635031 | 5732.437Hz | | 0.547395 | 6775.044Hz |
| Patch Idx | Similarity | Ref Patch: Start - End | Deg Patch: Start - End |
| 0 | 1.000000 | 0.180 - 0.580 | 1.440 - 1.840 | | 1 | 0.764560 | 2.181 - 2.580 | 2.180 - 2.579 | | 2 | 0.772817 | 2.580 - 2.980 | 2.580 - 2.980 | | 3 | 0.843457 | 3.780 - 4.180 | 3.780 - 4.180 | | 4 | 0.814809 | 4.180 - 4.580 | 4.180 - 4.580 | | 5 | 0.780449 | 4.580 - 4.980 | 4.580 - 4.980 | | 6 | 0.699916 | 5.380 - 5.780 | 5.380 - 5.780 | | 7 | 0.773998 | 5.781 - 6.180 | 5.780 - 6.179 | | 8 | 0.693399 | 6.181 - 6.580 | 6.180 - 6.579 | | 9 | 0.529567 | 6.580 - 6.980 | 6.560 - 6.960 | | 10 | 0.728254 | 8.180 - 8.580 | 8.180 - 8.580 | | 11 | 0.673384 | 8.580 - 8.980 | 8.580 - 8.980 | | 12 | 0.707640 | 8.980 - 9.380 | 8.980 - 9.380 |
For reference, Visqol Audio mode output: MOS-LQO: 3.41303
| FVNSIM | Freq Band |
| 0.533289 | 50.000Hz | | 0.544615 | 91.748Hz | | 0.645831 | 139.746Hz | | 0.804246 | 194.931Hz | | 0.902527 | 258.379Hz | | 0.936618 | 331.326Hz | | 0.961259 | 415.195Hz | | 0.957324 | 511.621Hz | | 0.950879 | 622.484Hz | | 0.941872 | 749.946Hz | | 0.922163 | 896.492Hz | | 0.927609 | 1064.979Hz | | 0.931144 | 1258.694Hz | | 0.944402 | 1481.411Hz | | 0.929926 | 1737.475Hz | | 0.933558 | 2031.877Hz | | 0.926355 | 2370.358Hz | | 0.924536 | 2759.518Hz | | 0.879479 | 3206.945Hz | | 0.863854 | 3721.361Hz | | 0.881097 | 4312.798Hz | | 0.862002 | 4992.786Hz | | 0.802238 | 5774.585Hz | | 0.704898 | 6673.438Hz | | 0.588221 | 7706.870Hz | | 0.578189 | 8895.030Hz | | 0.593581 | 10261.087Hz | | 0.599670 | 11831.674Hz | | 0.602659 | 13637.414Hz | | 0.621666 | 15713.517Hz | | 0.694320 | 18100.460Hz | | 0.786449 | 20844.785Hz |
| Patch Idx | Similarity | Ref Patch: Start - End | Deg Patch: Start - End |
| 0 | 1.000000 | 0.280 - 0.880 | 1.200 - 1.800 | | 1 | 1.000000 | 0.880 - 1.480 | 1.220 - 1.820 | | 2 | 1.000000 | 1.480 - 2.079 | 1.241 - 1.840 | | 3 | 0.681553 | 2.081 - 2.680 | 2.080 - 2.679 | | 4 | 0.632587 | 2.680 - 3.280 | 2.680 - 3.280 | | 5 | 0.891469 | 3.280 - 3.880 | 3.280 - 3.880 | | 6 | 0.648068 | 3.880 - 4.480 | 3.880 - 4.480 | | 7 | 0.681742 | 4.480 - 5.080 | 4.480 - 5.080 | | 8 | 0.599973 | 5.080 - 5.680 | 5.080 - 5.680 | | 9 | 0.611158 | 5.681 - 6.280 | 5.680 - 6.279 | | 10 | 0.511901 | 6.280 - 6.880 | 6.280 - 6.880 | | 11 | 0.938552 | 6.880 - 7.480 | 6.880 - 7.480 | | 12 | 0.884550 | 7.480 - 8.075 | 7.505 - 8.100 | | 13 | 0.611468 | 8.080 - 8.678 | 8.082 - 8.680 | | 14 | 0.567279 | 8.680 - 9.280 | 8.680 - 9.280 | | 15 | 0.990090 | 9.280 - 9.880 | 10.420 - 11.020 | | 16 | 1.000000 | 9.880 - 10.480 | 10.440 - 11.040 | | 17 | 1.000000 | 10.482 - 11.080 | 10.460 - 11.058 | | 18 | 0.995017 | 11.080 - 11.649 | 10.591 - 11.160 |
The audio sample itself a voice recording. Essentially, what I am trying to figure out, is - could feeding Visqol Voice samples with delay and a lot of silence be the culprit behind questionable scores we've been getting, or should we look for problems elsewhere.