linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
1.87k stars 150 forks source link

Words are correct but regular subtitles appear too early and linger? #85

Closed 2600box closed 1 year ago

2600box commented 1 year ago
whisper_timestamped --version
1.12.17
whisper_timestamped --debug --model large-v2 --accurate --vad True --detect_disfluencies True --output_dir .\ --language fr Audio.wav
100%|████████████████████████████| 160404/160404 [09:29<00:00, 281.77frames/s]
DEBUG:whisper_timestamped:Removing word 2881/3787 " ici ?" with empty duration at the end of segment 431/558
DEBUG:whisper_timestamped:Removing word 2880/3787 " fait-il" with empty duration at the end of segment 431/558
DEBUG:whisper_timestamped:Removing word 2879/3787 " que" with empty duration at the end of segment 431/558
DEBUG:whisper_timestamped:Removing word 97/3787 " retard." with empty duration at the end of segment 19/558
DEBUG:whisper_timestamped:Removing word 96/3787 " en" with empty duration at the end of segment 19/558
DEBUG:whisper_timestamped:Removing word 95/3787 " est" with empty duration at the end of segment 19/558
DEBUG:whisper_timestamped:Removing word 94/3787 " on" with empty duration at the end of segment 19/558
DEBUG:whisper_timestamped:Removing word 93/3787 " quand" with empty duration at the end of segment 19/558

This produces a word.vtt that matches the audio, but the non-words .vtt makes the subtitles appear early and linger. I have tried with --vad False --detect_disfluencies False and this does not happen, but the timing is off.

Also, with --vad False --detect_disfluencies False it starts with the line Sous-titrage Société Radio-Canada which is not audible and makes me very confused.

The correct timing for the first dialogue, with a separate line for each speaker, after intro music should be, according to me:

00:01:42,400 --> 00:01:43,359
Tu as ce que je t'ai demandé ?

00:01:44,080 --> 00:01:46,879
Oui, voilà ce que vous devez trouver.

The audio track is DTS 5.1 but I have uploaded an ffmpeg downmixed to stereo @64kbps AAC. Stereo.m4a.zip

Any advice on getting an accurate result here and where does this Sous-titrage Société Radio-Canada come from?!

Vad False Disfluencies False

WEBVTT

00:00.000 --> 00:02.280
Sous-titrage Société Radio-Canada

00:29.500 --> 00:40.160
...

00:59.500 --> 01:00.780
...

01:29.500 --> 01:33.280

...

01:41.500 --> 01:43.500
Tu as ce que je t'ai demandé ?

01:43.500 --> 01:45.840
Oui, voilà ce que vous devez trouver.

Vad True Disfluencies True - Words:

WEBVTT

00:00.490 --> 00:06.290
[*]

01:28.910 --> 01:29.220
tu

01:34.450 --> 01:35.730
[*]

01:42.300 --> 01:42.440
as

01:42.440 --> 01:42.580
ce

01:42.580 --> 01:42.680
que
39:44.390 --> 39:44.710
Quant

39:44.710 --> 39:44.890
à

39:44.890 --> 39:45.090
moi,

39:45.210 --> 39:45.370
c'est

39:45.370 --> 39:45.810
Dupont

39:45.810 --> 39:46.050
mais

39:46.050 --> 39:46.270
avec

39:46.270 --> 39:46.450
un

39:46.450 --> 39:46.750
D

39:48.390 --> 39:48.530
Vous

39:48.530 --> 39:48.730
vous

39:48.730 --> 39:49.350
rapprochez

39:49.350 --> 39:49.510
pour

39:49.510 --> 39:49.650
la

39:49.650 --> 39:50.050
photo

39:50.390 --> 39:50.810
Et

39:50.810 --> 39:51.130
sans

39:51.130 --> 39:51.710
oublier

39:51.710 --> 39:52.350
Milou

Vad True Disfluencies True - Regular subtitles:

WEBVTT

00:00.490 --> 01:48.520
tu as ce que je t'ai demandé oui voilà ce que vous devez trouver parfait le nom du
39:44.390 --> 39:46.750
Quant à moi, c'est Dupont mais avec un D

39:48.390 --> 39:50.050
Vous vous rapprochez pour la photo

39:50.390 --> 39:52.350
Et sans oublier Milou
Jeronymous commented 1 year ago

Concerning the "Sous-titrage Société Radio-Canada" it's just a training bias from Whisper models, on silence. Given that they were trained mostly on subtitled video, it regularly happened that empty audio signal at the end of videos were labeled "Sous-titrage ...", "Subtitles..." , "Thanks for having watched this video", ...

It seems that you are complaining at the regular VTT file, but I don't see anything wrong on the one you post in the issue description. @2600box Can you maybe spot what's wrong / tell what you'd expect?

2600box commented 1 year ago

Sorry for the delay in following up. Interesting about the Radio-Canada thing, though there is no silence. It is music that plays through the opening sequence.

The problem is that the subtitles start too early when VAD and Disfluencies are set to true. They start from 00:00.490 but should not come in until 00:01:42,400

You can see below that the subtitle is ignoring the music that should be indicated by [*] for the first section. If I correctly understand how this should work...

Here is the regular subtitle that starts too early:

WEBVTT

00:00.490 --> 01:43.800
Tu as ce que je t'ai demandé ?

01:43.860 --> 01:44.160
Oui.

01:44.160 --> 01:45.840
Voilà ce que vous devez trouver.

01:46.820 --> 01:47.340
Parfait.

Here is the Words subtitle that is timed correctly. So there is something wrong with the conversation to regular subtitles.

WEBVTT

00:00.490 --> 00:06.290
[*]

01:29.110 --> 01:29.450
Tu

01:34.450 --> 01:35.760
[*]

01:42.300 --> 01:42.460
as

01:42.460 --> 01:42.560
ce

01:42.560 --> 01:42.700
que

01:42.700 --> 01:42.780
je

01:42.780 --> 01:42.940
t'ai

01:42.940 --> 01:43.800
demandé ?

01:43.860 --> 01:44.160
Oui.

01:44.160 --> 01:45.000
Voilà

01:45.000 --> 01:45.160
ce

01:45.160 --> 01:45.260
que

01:45.260 --> 01:45.380
vous

01:45.380 --> 01:45.580
devez

01:45.580 --> 01:45.840
trouver.

01:46.820 --> 01:47.340
Parfait.
2600box commented 1 year ago

Hello, did you work on this issue? Because it is resolved with the latest version 1.12.20 which is fantastic! Thanks for this project

WEBVTT

01:42.100 --> 01:43.800
Tu as ce que je t'ai demandé ?

01:43.860 --> 01:44.160
Oui.

01:44.160 --> 01:45.840
Voilà ce que vous devez trouver.

01:46.940 --> 01:47.340
Parfait.