Voice activity detection/patch alignment

Hello!

I have some questions regarding the behaviour of the patch detection and alignment.

Firstly I would like to clarify - in Visqol's --verbose output, the patch start and end times - are the times displayed before or after the global alignment part of Visqol? IE if there is an overall 2 second delay in the degraded patch, which to my understanding gets detected during the global alignment, will the patch time output show post or pre-alignment times?

Secondly - the sample guidelines mentioned in Readme (8-10 seconds long, 0.5 seconds of silence at the beginning and end, not much silence in the middle of sample) - does this apply to Speech mode too, or should the alignment and voice detection of speech mode handle audio with delay, and with a lot of silence in the sample?

An example of a samples that I am currently trying to use with Visqol: View from audacity:

Visqol Speech mode output MOS-LQO: 2.52259

| FVNSIM | Freq Band |

| 0.377030 | 50.000Hz | | 0.476088 | 98.767Hz | | 0.459481 | 156.063Hz | | 0.763644 | 223.380Hz | | 0.835369 | 302.471Hz | | 0.923244 | 395.394Hz | | 0.926039 | 504.570Hz | | 0.903042 | 632.839Hz | | 0.884796 | 783.543Hz | | 0.844095 | 960.604Hz | | 0.841896 | 1168.633Hz | | 0.866645 | 1413.046Hz | | 0.860731 | 1700.205Hz | | 0.856387 | 2037.587Hz | | 0.825984 | 2433.977Hz | | 0.823645 | 2899.694Hz | | 0.745470 | 3446.863Hz | | 0.706072 | 4089.731Hz | | 0.700011 | 4845.034Hz | | 0.635031 | 5732.437Hz | | 0.547395 | 6775.044Hz |

| 0 | 1.000000 | 0.180 - 0.580 | 1.440 - 1.840 | | 1 | 0.764560 | 2.181 - 2.580 | 2.180 - 2.579 | | 2 | 0.772817 | 2.580 - 2.980 | 2.580 - 2.980 | | 3 | 0.843457 | 3.780 - 4.180 | 3.780 - 4.180 | | 4 | 0.814809 | 4.180 - 4.580 | 4.180 - 4.580 | | 5 | 0.780449 | 4.580 - 4.980 | 4.580 - 4.980 | | 6 | 0.699916 | 5.380 - 5.780 | 5.380 - 5.780 | | 7 | 0.773998 | 5.781 - 6.180 | 5.780 - 6.179 | | 8 | 0.693399 | 6.181 - 6.580 | 6.180 - 6.579 | | 9 | 0.529567 | 6.580 - 6.980 | 6.560 - 6.960 | | 10 | 0.728254 | 8.180 - 8.580 | 8.180 - 8.580 | | 11 | 0.673384 | 8.580 - 8.980 | 8.580 - 8.980 | | 12 | 0.707640 | 8.980 - 9.380 | 8.980 - 9.380 |

For reference, Visqol Audio mode output: MOS-LQO: 3.41303

| FVNSIM | Freq Band |

| 0.533289 | 50.000Hz | | 0.544615 | 91.748Hz | | 0.645831 | 139.746Hz | | 0.804246 | 194.931Hz | | 0.902527 | 258.379Hz | | 0.936618 | 331.326Hz | | 0.961259 | 415.195Hz | | 0.957324 | 511.621Hz | | 0.950879 | 622.484Hz | | 0.941872 | 749.946Hz | | 0.922163 | 896.492Hz | | 0.927609 | 1064.979Hz | | 0.931144 | 1258.694Hz | | 0.944402 | 1481.411Hz | | 0.929926 | 1737.475Hz | | 0.933558 | 2031.877Hz | | 0.926355 | 2370.358Hz | | 0.924536 | 2759.518Hz | | 0.879479 | 3206.945Hz | | 0.863854 | 3721.361Hz | | 0.881097 | 4312.798Hz | | 0.862002 | 4992.786Hz | | 0.802238 | 5774.585Hz | | 0.704898 | 6673.438Hz | | 0.588221 | 7706.870Hz | | 0.578189 | 8895.030Hz | | 0.593581 | 10261.087Hz | | 0.599670 | 11831.674Hz | | 0.602659 | 13637.414Hz | | 0.621666 | 15713.517Hz | | 0.694320 | 18100.460Hz | | 0.786449 | 20844.785Hz |

| 0 | 1.000000 | 0.280 - 0.880 | 1.200 - 1.800 | | 1 | 1.000000 | 0.880 - 1.480 | 1.220 - 1.820 | | 2 | 1.000000 | 1.480 - 2.079 | 1.241 - 1.840 | | 3 | 0.681553 | 2.081 - 2.680 | 2.080 - 2.679 | | 4 | 0.632587 | 2.680 - 3.280 | 2.680 - 3.280 | | 5 | 0.891469 | 3.280 - 3.880 | 3.280 - 3.880 | | 6 | 0.648068 | 3.880 - 4.480 | 3.880 - 4.480 | | 7 | 0.681742 | 4.480 - 5.080 | 4.480 - 5.080 | | 8 | 0.599973 | 5.080 - 5.680 | 5.080 - 5.680 | | 9 | 0.611158 | 5.681 - 6.280 | 5.680 - 6.279 | | 10 | 0.511901 | 6.280 - 6.880 | 6.280 - 6.880 | | 11 | 0.938552 | 6.880 - 7.480 | 6.880 - 7.480 | | 12 | 0.884550 | 7.480 - 8.075 | 7.505 - 8.100 | | 13 | 0.611468 | 8.080 - 8.678 | 8.082 - 8.680 | | 14 | 0.567279 | 8.680 - 9.280 | 8.680 - 9.280 | | 15 | 0.990090 | 9.280 - 9.880 | 10.420 - 11.020 | | 16 | 1.000000 | 9.880 - 10.480 | 10.440 - 11.040 | | 17 | 1.000000 | 10.482 - 11.080 | 10.460 - 11.058 | | 18 | 0.995017 | 11.080 - 11.649 | 10.591 - 11.160 |

The audio sample itself a voice recording. Essentially, what I am trying to figure out, is - could feeding Visqol Voice samples with delay and a lot of silence be the culprit behind questionable scores we've been getting, or should we look for problems elsewhere.

google / visqol

Voice activity detection/patch alignment #54