jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 176 forks source link

Any options to prevent "flickering" #355

Open nns2009 opened 6 months ago

nns2009 commented 6 months ago

Let's say one segment ends at 4.7s and another one starts at 4.8s. This causes the first segment to disable a bit before the next one is enabled, thus resulting in no subtitle being shown for a brief moment - "flickering". I would prefer (in case the subsequent segments are "close enough") to hold the previous segment a bit longer and/or maybe start the following segment a bit earlier, so one subtitle is swapped directly into the following one. Is it possible?

Related question: is it possible to begin all segments (which do have some space before/after them) a bit earlier and to end them a bit later, thus making the timings less exact, but allowing the viewer more reading time?

P.S. Thanks for the library! It works great

jianfch commented 6 months ago

There are no options that directly extends timestamp, but you can prevent "flickering" with regrouping methods such as merge_by_gap() (for CLI: --regroup da_mg).

However, if you want to preserve the segments as they are and only change the timestamps, you'll need something like this:

import stable_whisper
model = stable_whisper.load_model('base')
result = model.transcribe('audio.mp3')
for i, segment in enumerate(result):
    if i+1 == len(result):
        break
    next_start = result[i+1].start
    if next_start - segment.end <= 0.100:
        segment.end = next_start

Related question: is it possible to begin all segments (which do have some space before/after them) a bit earlier and to end them a bit later, thus making the timings less exact, but allowing the viewer more reading time?

Likewise, there are no options for this because all the options in stable-ts are centered around squeezing the timestamps to its word. So something like the script above should do.

nns2009 commented 6 months ago

@jianfch Thanks a lot for the answer and an example! Based on it and the documentation, I wrote a script, which serves the purpose: good_subs.txt (.txt extension because GitHub won't allow an upload otherwise) It does three things:

Although I think such functionality would really benefit to have in stable-ts itself, as it's needed for the consumer.

This library and my script helped me a lot to get a good subtitle draft (much better than a YouTube auto-subs). Unfortunately, I still had to manually adjust many timings. Can you suggest the best settings (I guess, refine settings) in terms of quality when one doesn't care about the execution time. I don't mind running it even for 10 hours for a 10-minute video, just so I can save my own time.

jianfch commented 6 months ago

Can you suggest the best settings (I guess, refine settings) in terms of quality when one doesn't care about the execution time

refine is still rather experimental, so there are no specific settings that will produce higher quality timings than others because the result can vary from case to case. Generally, the key is a balance between settings that control how much to deviate from the initial confidence scores (e.g. rel_prob_decrease) and the number of steps. When the value of the former is high then use fewer steps and vice-verse). The default settings plays it safe by using low values and low number of steps to avoid any drastic changes in the timestamps.

Note that refine() was not working properly which might be why you did not see all changes after using refine(). It was fixed in 864b76c1d0b8946638dfda6fb6ed577958c5c578.