google / visqol

Perceptual Quality Estimator for speech and audio
Apache License 2.0
640 stars 118 forks source link

Voice activity detection/patch alignment #54

Open rolandsjlociks opened 2 years ago

rolandsjlociks commented 2 years ago

Hello!

I have some questions regarding the behaviour of the patch detection and alignment.

Firstly I would like to clarify - in Visqol's --verbose output, the patch start and end times - are the times displayed before or after the global alignment part of Visqol? IE if there is an overall 2 second delay in the degraded patch, which to my understanding gets detected during the global alignment, will the patch time output show post or pre-alignment times?

Secondly - the sample guidelines mentioned in Readme (8-10 seconds long, 0.5 seconds of silence at the beginning and end, not much silence in the middle of sample) - does this apply to Speech mode too, or should the alignment and voice detection of speech mode handle audio with delay, and with a lot of silence in the sample?

An example of a samples that I am currently trying to use with Visqol: View from audacity: image

Visqol Speech mode output MOS-LQO: 2.52259

| FVNSIM | Freq Band |

| 0.377030 | 50.000Hz | | 0.476088 | 98.767Hz | | 0.459481 | 156.063Hz | | 0.763644 | 223.380Hz | | 0.835369 | 302.471Hz | | 0.923244 | 395.394Hz | | 0.926039 | 504.570Hz | | 0.903042 | 632.839Hz | | 0.884796 | 783.543Hz | | 0.844095 | 960.604Hz | | 0.841896 | 1168.633Hz | | 0.866645 | 1413.046Hz | | 0.860731 | 1700.205Hz | | 0.856387 | 2037.587Hz | | 0.825984 | 2433.977Hz | | 0.823645 | 2899.694Hz | | 0.745470 | 3446.863Hz | | 0.706072 | 4089.731Hz | | 0.700011 | 4845.034Hz | | 0.635031 | 5732.437Hz | | 0.547395 | 6775.044Hz |

| Patch Idx | Similarity | Ref Patch: Start - End | Deg Patch: Start - End |

| 0 | 1.000000 | 0.180 - 0.580 | 1.440 - 1.840 | | 1 | 0.764560 | 2.181 - 2.580 | 2.180 - 2.579 | | 2 | 0.772817 | 2.580 - 2.980 | 2.580 - 2.980 | | 3 | 0.843457 | 3.780 - 4.180 | 3.780 - 4.180 | | 4 | 0.814809 | 4.180 - 4.580 | 4.180 - 4.580 | | 5 | 0.780449 | 4.580 - 4.980 | 4.580 - 4.980 | | 6 | 0.699916 | 5.380 - 5.780 | 5.380 - 5.780 | | 7 | 0.773998 | 5.781 - 6.180 | 5.780 - 6.179 | | 8 | 0.693399 | 6.181 - 6.580 | 6.180 - 6.579 | | 9 | 0.529567 | 6.580 - 6.980 | 6.560 - 6.960 | | 10 | 0.728254 | 8.180 - 8.580 | 8.180 - 8.580 | | 11 | 0.673384 | 8.580 - 8.980 | 8.580 - 8.980 | | 12 | 0.707640 | 8.980 - 9.380 | 8.980 - 9.380 |

For reference, Visqol Audio mode output: MOS-LQO: 3.41303

| FVNSIM | Freq Band |

| 0.533289 | 50.000Hz | | 0.544615 | 91.748Hz | | 0.645831 | 139.746Hz | | 0.804246 | 194.931Hz | | 0.902527 | 258.379Hz | | 0.936618 | 331.326Hz | | 0.961259 | 415.195Hz | | 0.957324 | 511.621Hz | | 0.950879 | 622.484Hz | | 0.941872 | 749.946Hz | | 0.922163 | 896.492Hz | | 0.927609 | 1064.979Hz | | 0.931144 | 1258.694Hz | | 0.944402 | 1481.411Hz | | 0.929926 | 1737.475Hz | | 0.933558 | 2031.877Hz | | 0.926355 | 2370.358Hz | | 0.924536 | 2759.518Hz | | 0.879479 | 3206.945Hz | | 0.863854 | 3721.361Hz | | 0.881097 | 4312.798Hz | | 0.862002 | 4992.786Hz | | 0.802238 | 5774.585Hz | | 0.704898 | 6673.438Hz | | 0.588221 | 7706.870Hz | | 0.578189 | 8895.030Hz | | 0.593581 | 10261.087Hz | | 0.599670 | 11831.674Hz | | 0.602659 | 13637.414Hz | | 0.621666 | 15713.517Hz | | 0.694320 | 18100.460Hz | | 0.786449 | 20844.785Hz |

| Patch Idx | Similarity | Ref Patch: Start - End | Deg Patch: Start - End |

| 0 | 1.000000 | 0.280 - 0.880 | 1.200 - 1.800 | | 1 | 1.000000 | 0.880 - 1.480 | 1.220 - 1.820 | | 2 | 1.000000 | 1.480 - 2.079 | 1.241 - 1.840 | | 3 | 0.681553 | 2.081 - 2.680 | 2.080 - 2.679 | | 4 | 0.632587 | 2.680 - 3.280 | 2.680 - 3.280 | | 5 | 0.891469 | 3.280 - 3.880 | 3.280 - 3.880 | | 6 | 0.648068 | 3.880 - 4.480 | 3.880 - 4.480 | | 7 | 0.681742 | 4.480 - 5.080 | 4.480 - 5.080 | | 8 | 0.599973 | 5.080 - 5.680 | 5.080 - 5.680 | | 9 | 0.611158 | 5.681 - 6.280 | 5.680 - 6.279 | | 10 | 0.511901 | 6.280 - 6.880 | 6.280 - 6.880 | | 11 | 0.938552 | 6.880 - 7.480 | 6.880 - 7.480 | | 12 | 0.884550 | 7.480 - 8.075 | 7.505 - 8.100 | | 13 | 0.611468 | 8.080 - 8.678 | 8.082 - 8.680 | | 14 | 0.567279 | 8.680 - 9.280 | 8.680 - 9.280 | | 15 | 0.990090 | 9.280 - 9.880 | 10.420 - 11.020 | | 16 | 1.000000 | 9.880 - 10.480 | 10.440 - 11.040 | | 17 | 1.000000 | 10.482 - 11.080 | 10.460 - 11.058 | | 18 | 0.995017 | 11.080 - 11.649 | 10.591 - 11.160 |

The audio sample itself a voice recording. Essentially, what I am trying to figure out, is - could feeding Visqol Voice samples with delay and a lot of silence be the culprit behind questionable scores we've been getting, or should we look for problems elsewhere.

mchinen commented 2 years ago

Hi, thank you for the question, and apologies for not replying until now.

I believe that the times listed are the original timestamps of the file, before they are aligned. ViSQOL might behave poorly if the global delay is greater than half a patch. This will probably cause there to be some reference patches that don't have nice matches. https://github.com/google/visqol/blob/master/src/alignment.cc#L65

However, given that your audio has silence to start, I'm not sure how much of an issue it is, since the VAD should not be active. I'm not sure if this helps, but let me know if I can provide more info.