linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.06k stars 156 forks source link

Incorrect timestamps when using VAD with large model only #192

Open freddyertl opened 7 months ago

freddyertl commented 7 months ago

I came across a problem when using VAD (silero and auditok) with the large model in my application where I try to break parts of the transcription based on pauses. In the following sample you can see that with VAD and the large model (but not with the smaller ones!), I get incorrect timestamps for the "A":

--model large --language en --accurate --vad auditok

[01:37.040 --> 01:37.240] Could [01:37.240 --> 01:37.360] you [01:37.360 --> 01:37.540] please [01:37.540 --> 01:37.800] hold [01:37.800 --> 01:37.920] up [01:37.920 --> 01:38.060] your [01:38.060 --> 01:38.280] ID [01:38.280 --> 01:38.500] to [01:38.500 --> 01:38.660] the [01:38.660 --> 01:38.860] webcam? *** [01:39.120 --> 01:39.700] A >>>>>>>>>>>>>> Pause between "A" and "little" is wrong [01:45.050 --> 01:45.250] little [01:45.250 --> 01:45.430] bit [01:45.430 --> 01:45.690] closer, [01:45.770 --> 01:46.070] please.

--model large --language en --accurate --vad silero:v3.1

[01:37.030 --> 01:37.230] Could [01:37.230 --> 01:37.350] you [01:37.350 --> 01:37.550] please [01:37.550 --> 01:37.830] hold [01:37.830 --> 01:37.930] up [01:37.930 --> 01:38.070] your [01:38.070 --> 01:38.290] ID [01:38.290 --> 01:38.510] to [01:38.510 --> 01:38.650] the [01:38.650 --> 01:38.810] webcam? *** [01:39.130 --> 01:39.790] A >>>>>>>>>>>>>> Pause between "A" and "little" is wrong [01:45.050 --> 01:45.230] little [01:45.230 --> 01:45.430] bit [01:45.430 --> 01:45.690] closer, [01:45.770 --> 01:46.050] please.

--model large --language en --accurate --vad False

[01:36.860 --> 01:37.180] Could [01:37.180 --> 01:37.340] you [01:37.340 --> 01:37.600] please [01:37.600 --> 01:37.820] hold [01:37.820 --> 01:37.940] up [01:37.940 --> 01:38.060] your [01:38.060 --> 01:38.280] ID [01:38.280 --> 01:38.520] to [01:38.520 --> 01:38.660] the [01:38.660 --> 01:38.920] webcam? *** [01:44.240 --> 01:45.020] A >>>>>>>>>>>>>> This is okay [01:45.020 --> 01:45.260] little [01:45.260 --> 01:45.420] bit [01:45.420 --> 01:45.680] closer, [01:45.820 --> 01:46.020] please.

--model medium --language en --accurate --vad auditok

[01:37.180 --> 01:37.360] Could [01:37.360 --> 01:37.520] you [01:37.520 --> 01:37.780] please [01:37.780 --> 01:37.940] hold [01:37.940 --> 01:38.080] up [01:38.080 --> 01:38.260] your [01:38.260 --> 01:38.500] ID [01:38.500 --> 01:38.680] to [01:38.680 --> 01:38.800] the [01:38.800 --> 01:39.260] webcam? *** [01:44.890 --> 01:45.270] A >>>>>>>>>>>>>> This is okay [01:45.270 --> 01:45.410] little [01:45.410 --> 01:45.610] bit [01:45.610 --> 01:45.850] closer, [01:46.070 --> 01:46.650] please.

Please find the attached sample audio in a zip archive to reproduce this.

Thanks in advance Freddy

sample.zip

LaurinmyReha commented 2 months ago

Maybe this variant will solve your problems.

https://github.com/nyrahealth/CrisperWhisper

Timestamps around pauses are notoriously bad for the whisper model when using DTW due to the tokenizer. More details can be found in the accompanying paper: https://arxiv.org/abs/2408.16589