Closed 2600box closed 1 year ago
Concerning the "Sous-titrage Société Radio-Canada" it's just a training bias from Whisper models, on silence. Given that they were trained mostly on subtitled video, it regularly happened that empty audio signal at the end of videos were labeled "Sous-titrage ...", "Subtitles..." , "Thanks for having watched this video", ...
It seems that you are complaining at the regular VTT file, but I don't see anything wrong on the one you post in the issue description. @2600box Can you maybe spot what's wrong / tell what you'd expect?
Sorry for the delay in following up. Interesting about the Radio-Canada thing, though there is no silence. It is music that plays through the opening sequence.
The problem is that the subtitles start too early when VAD and Disfluencies are set to true. They start from 00:00.490 but should not come in until 00:01:42,400
You can see below that the subtitle is ignoring the music that should be indicated by [*]
for the first section. If I correctly understand how this should work...
Here is the regular subtitle that starts too early:
WEBVTT
00:00.490 --> 01:43.800
Tu as ce que je t'ai demandé ?
01:43.860 --> 01:44.160
Oui.
01:44.160 --> 01:45.840
Voilà ce que vous devez trouver.
01:46.820 --> 01:47.340
Parfait.
Here is the Words subtitle that is timed correctly. So there is something wrong with the conversation to regular subtitles.
WEBVTT
00:00.490 --> 00:06.290
[*]
01:29.110 --> 01:29.450
Tu
01:34.450 --> 01:35.760
[*]
01:42.300 --> 01:42.460
as
01:42.460 --> 01:42.560
ce
01:42.560 --> 01:42.700
que
01:42.700 --> 01:42.780
je
01:42.780 --> 01:42.940
t'ai
01:42.940 --> 01:43.800
demandé ?
01:43.860 --> 01:44.160
Oui.
01:44.160 --> 01:45.000
Voilà
01:45.000 --> 01:45.160
ce
01:45.160 --> 01:45.260
que
01:45.260 --> 01:45.380
vous
01:45.380 --> 01:45.580
devez
01:45.580 --> 01:45.840
trouver.
01:46.820 --> 01:47.340
Parfait.
Hello, did you work on this issue? Because it is resolved with the latest version 1.12.20 which is fantastic! Thanks for this project
WEBVTT
01:42.100 --> 01:43.800
Tu as ce que je t'ai demandé ?
01:43.860 --> 01:44.160
Oui.
01:44.160 --> 01:45.840
Voilà ce que vous devez trouver.
01:46.940 --> 01:47.340
Parfait.
This produces a word.vtt that matches the audio, but the non-words .vtt makes the subtitles appear early and linger. I have tried with
--vad False --detect_disfluencies False
and this does not happen, but the timing is off.Also, with
--vad False --detect_disfluencies False
it starts with the lineSous-titrage Société Radio-Canada
which is not audible and makes me very confused.The correct timing for the first dialogue, with a separate line for each speaker, after intro music should be, according to me:
The audio track is DTS 5.1 but I have uploaded an ffmpeg downmixed to stereo @64kbps AAC. Stereo.m4a.zip
Any advice on getting an accurate result here and where does this
Sous-titrage Société Radio-Canada
come from?!Vad False Disfluencies False
Vad True Disfluencies True - Words:
Vad True Disfluencies True - Regular subtitles: