linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Add a max_line_length parameter to subtitle files #31

Closed ubanning closed 1 year ago

ubanning commented 1 year ago

Hello, first of all thanks for your work, I'm here to give a suggestion.

With word-level timestamps I think it would be possible to add a character limit per line/time in SRT and VTT subtitle files without using a simple line break.

Depending on the audio, the characters can exceed 200+ per line and I believe this problem can be fixed with this implementation.

If it's not possible to add this parameter, when you have time, could you provide me with some code that would make this idea work? (I'm not from the programming area and I have a little difficulty)

Here's a discussion on the subject on Whisper so you can understand a little better: Improve default line lengths in subtitle files

Thanks.

Jeronymous commented 1 year ago

Thank you @ubanning for the suggestion.

Just to make sure I got your idea, you are suggesting to use the word timestamps to cut long segments into several small ones (based on the number of characters per segment), right? For instance instead of hainvg

36
00:00:10,220 --> 00:00:18,260
Giving away to Grealish, whose camp is clear of Hoibier, and goes over Romero, who is going to walk.

we could have

36
00:00:10,220 --> 00:00:14,840
Giving away to Grealish, whose camp is clear of Hoibier,

37
00:00:14,840 --> 00:00:18,260
and goes over Romero, who is going to walk.

I'm thinking that the best way to achieve this would be to do it in a separate script that would take as input the .words.json files generated by whisper_timestamped, and produce SRT / VTT files (using a max_line_length option). Would you feel comfortable with that option?

ubanning commented 1 year ago

Exactly, this is what I would like. Your example is perfect.

It can be the way you feel most comfortable and think it's better 😊 I think maybe something that could complicate it is something related to the punctuation, so that it doesn't get cut in half. (which is not the case in your example, but which may happen in the future)

Thanks for the help and the answer.

Jeronymous commented 1 year ago

OK I made a script in whisper_timestamped/make_subtitles.py (which can is called whisper_timestamped_make_subtitles after "setup install") and that takes the words.json files produced by "whisper_timestamped" to produce SRT and/or VTT files with a maximum character length for all the segments, and a preference to cut after punctuation marks (as you suggest):

# whisper_timestamped_make_subtitles -h
usage: whisper_timestamped_make_subtitles [-h] [--max_length MAX_LENGTH] [--format {srt,vtt,all}] input output

Convert .word.json transcription files (output of whisper_timestamped) to srt or vtt, being able to cut long segments

positional arguments:
  input                 Input json file, or input folder
  output                Output srt or vtt file, or output folder

optional arguments:
  -h, --help            show this help message and exit
  --max_length MAX_LENGTH
                        Maximum length of a segment in characters (default: 200)
  --format {srt,vtt,all}
                        Output format (if the output is a folder, i.e. not a file with an explicit extension) (default: all)

Feel free to produce any feedback.

Thanks @ubanning

dobaret commented 1 year ago

Hello @Jeronymous,

I'm trying to create shorter subtitles with your instructions but I'm currently failing.

Here's my input: whisper_timestamped_make_subtitles --max_length 43 example.words.json ./vtt

And here's the result: Traceback (most recent call last): File "C:\Users\dorian.baret\AppData\Local\Programs\Python\Python39\Scripts\whisper_timestamped_make_subtitles-script.py", line 33, in <module> sys.exit(load_entry_point('whisper-timestamped==1.9.1', 'console_scripts', 'whisper_timestamped_make_subtitles')()) File "c:\users\dorian.baret\appdata\local\programs\python\python39\lib\site-packages\whisper_timestamped\make_subtitles.py", line 121, in cli segments = split_long_segments(segments, args.max_length, use_space=use_space) File "c:\users\dorian.baret\appdata\local\programs\python\python39\lib\site-packages\whisper_timestamped\make_subtitles.py", line 22, in split_long_segments assert len(words) == len(meta_words) AssertionError

Is there something wrong with my input, or is this a bug perhaps?

Jeronymous commented 1 year ago

Thanks for reporting with all the details.

Your command line seems to be correct, so it's seems to be a bug (corner case not well handled).

Is there a chance that you can provide the example.words.json? (you can enclose it here in a zip)

dobaret commented 1 year ago

Bonjour Jérôme,

Thank you for your answer, unfortunately I can't share the files as it's non-public corporate content.

However I think I've identified the issue: the script fails every time I fetch it a video in French. The videos from my company are in French, and I also tried using some videos from YouTube in French (three different ones, from different channels), and they all fail.

Spanish works fine, so it doesn't seem to be all videos that are not in English.

Jeronymous commented 1 year ago

I am French and tested the stuff quite thoroughly in French (with things like "Dis-moi, est-ce que l'avion vole?"), so it does not help... I would be surprised that it fails for you for any video in French...

Anyway, I've just pushed a fix, that should also print you a "WARNING: xxx != yyy" for these corner cases that I don't understand. If you can share one of these corner cases (anonymizing some parts if necessary), I would appreciate so that I can understand.

I also tried using some videos from YouTube in French (three different ones, from different channels)

Or maybe you can share the "words.json" files for these videos that do not seem to be non-public

dobaret commented 1 year ago

I've just updated and it does seem to work now, except that there now seems to be an issue with the encoding of accents.

Here's a corner case (with wonky accents):

WARNING: Je peux cliquer dessus pour les masquer, ou je peux directement à partir de là , sélectionner tout ce qui est usinable ou pas usinable. != Je peux cliquer dessus pour les masquer, ou je peux directement à partir de là, sélectionner tout ce qui est usinable ou pas usinable.

And here are two of the JSON files that failed this morning: words_whisper_timestamped_french.zip

Jeronymous commented 1 year ago

oh I see, your default encoding is not utf8 and I was not explicitly setting the encoding of the file when reading/writing in make_subtitles.py

It should be fixed now.

(and it was not failing for me on the json you sent, so I guess it was also an encoding issue)

Many thanks!

dobaret commented 1 year ago

Works like a charm now, thank you!

g-vidal commented 1 year ago

Works perfectly many thanks :+1: