jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 176 forks source link

Undesired splitting with regroup #311

Closed darioai closed 7 months ago

darioai commented 9 months ago

@jianfch , first of all, thanks so much for creating stable-ts! This thing is amazing! Now, to my issue: I'm using regroup to keep segments at a max_chars = 65, which after multiple tries (I don't have any formal training in Python) is working as it should 👍 However, regroup is causing some undesired splitting to subtitles under the set max_chars, like the example below:

From: 00:00:00.096 --> 00:00:02.440 En un edificio como este es un juego

to 00:00:00.096 --> 00:00:00.196 En

00:00:00.768 --> 00:00:02.440 un edificio como este es un juego

Any idea on how to fix this? Thanks!

jianfch commented 9 months ago

It seems to have split due to a gap from 00:00:00.196 to 00:00:00.768. Which methods did you use?

darioai commented 9 months ago

This is the command I'm using: command = ["stable-ts", audio_path, "-o", transcription_path, "--model", "large-v3", "--vad", "true", "--segment_level", "true", "--word_level", "false", "--max_chars", str(max_chars_value)] Is there a way to turn services off from the command?

darioai commented 9 months ago

I just added "--regroup", "sl" to the command and it took care of the problem. Thanks for the quick reply!

jianfch commented 9 months ago

When --max_chars or --max_words is specified, it calls split_by_length() after it regroups. Since --regroup true is the default, it will use the default regrouping algorithm which includes split_by_gap(). So disabling the default regrouping with --regroup false should achieve the same results with less compute. Note that --regroup "sl" will call split_by_length() without specifying max_chars or max_words which means it will do nothing.

darioai commented 9 months ago

Is there a way to disable split_by_gap w/o disabling regroup? I

jianfch commented 9 months ago

Is there a way to disable split_by_gap w/o disabling regroup?

The default regrouping can be represent as "cm_sp=.* /。/?/?/,* /,_sg=.5_mg=.3+3_sp=.* /。/?/?", True, or a chain of regrouping methods in the section about regrouping. So simply remove sg=.5 to disable the split_by_gap(.5). The first .* /。/?/? is also redundant so remove that as well. You end up with --refine "cm_sp=,* /,_mg=.3+3_sp=.* /。/?/?".

darioai commented 9 months ago

Thanks!!!!

darioai commented 9 months ago

Since different caption systems have different requirements for the max chars/line, I've been using --max_chars = (user input max chars/line)*2+1. Then I use textwrap to rewrap subtitles over the max characters/line into two lines. textwrarp breaks the lines according to the following priorities: a- Punctuation b- before conjunctions and prepositions c- At the last space within max characters/line There's still improvement to be done, but so far the results are pretty good.