jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.49k stars 170 forks source link

wrong subtitle timing #192

Closed electro199 closed 1 year ago

electro199 commented 1 year ago

Sometimes the transcription misses the segments and stretches the segment after the missed segment to start of the segments

The timing error is not consistent it usually appears after the 40-second mark. This kind of error does not happen all the time

it is the trimmed version of the clip https://github.com/jianfch/stable-ts/assets/109358640/6257d3a4-bac5-4b48-84cb-d492373d64e9

the ASS file show similar thing

Dialogue: 185,0:03:7.72,0:03:8.64,Default,,0,0,0,,I'm still wearing my
Dialogue: 186,0:03:8.64,0:03:9.50,Default,,0,0,0,,radioactive suit.
Dialogue: 187,0:03:10.10,0:03:11.08,Default,,0,0,0,,Did I mention I had that
Dialogue: 188,0:03:11.08,0:03:11.28,Default,,0,0,0,,on?
Dialogue: 189,0:03:11.82,0:03:12.06,Default,,0,0,0,,No?
Dialogue: 190,0:03:12.56,0:03:12.90,Default,,0,0,0,,I did.
Dialogue: 191,0:03:13.54,0:03:14.58,Default,,0,0,0,,The car is now dead as
Dialogue: 192,0:03:14.58,0:03:14.74,Default,,0,0,0,,Doc,
Dialogue: 193,0:03:15.26,0:03:16.34,Default,,0,0,0,,so I push it behind the
Dialogue: 194,0:03:16.34,0:03:17.20,Default,,0,0,0,,billboard for now.
Dialogue: 195,0:03:17.76,0:03:18.62,Default,,0,0,0,,I'm completely lost.
Dialogue: 196,0:03:19.22,0:03:20.52,Default,,0,0,0,,No idea what drugs the
Dialogue: 197,0:03:20.52,0:03:21.60,Default,,0,0,0,,old man slipped me this
Dialogue: 198,0:03:21.60,0:03:21.94,Default,,0,0,0,,time.
Dialogue: 199,0:03:22.58,0:03:23.60,Default,,0,0,0,,Maybe the same as last
Dialogue: 200,0:03:23.60,0:03:23.94,Default,,0,0,0,,night,
Dialogue: 201,0:03:24.24,0:03:25.28,Default,,0,0,0,,but this is the craziest
Dialogue: 202,0:03:25.28,0:03:25.36,Default,,0,0,0,,thing I've ever seen.
Dialogue: 203,0:03:25.38, 0:03:42.36,Default,,0,0,0,,I'm still in the  <---------------------------------------------------
Dialogue: 204,0:03:42.36,0:03:43.22,Default,,0,0,0,,radioactive suit,
Dialogue: 205,0:03:43.58,0:03:44.60,Default,,0,0,0,,and his fire is around
Dialogue: 206,0:03:44.60,0:03:45.30,Default,,0,0,0,,from his shotgun.
Dialogue: 207,0:03:45.54,0:03:46.36,Default,,0,0,0,,I run back into the
Dialogue: 208,0:03:46.36,0:03:46.64,Default,,0,0,0,,barn,

script I am using

model = stable_whisper.load_model(
        "base"
    )

sub_path = f"postaudio.ass"
        result = model.transcribe(f"postaudio.mp3", **options)  # type: ignore
        result: stable_whisper.WhisperResult = result.split_by_length(settings.config["subtitle"]["characters_at_time"])  # type: ignore
        result.to_ass(
            sub_path, 
            tag=(r"{\1c&H34ebde&}", r"{\r}"), # not working also 
            highlight_color=False, # type: ignore
            font_size=0,
            word_level=False,
            **sub_styling
        )

I again rain program to test after running the model again to predict there is no missing segments(I have a script to text empty space b/w segments in json file)

stable-ts assets\temp\9tonul\audio.mp3 -o out.json
stable-ts out.json -o out.ass --max_chars 25

I reran the script and now other subtitle segments are missing.

Dialogue: 227,0:03:49.02,0:03:49.82,Default,,0,0,0,,I blast through that
Dialogue: 228,0:03:49.82,0:03:50.78,Default,,0,0,0,,wooden barn door like
Dialogue: 229,0:03:50.78,0:03:51.34,Default,,0,0,0,,it's plywood,
Dialogue: 230,0:03:51.86,0:03:53.22,Default,,0,0,0,,and fly past the old guy
Dialogue: 231,0:03:53.22,0:03:53.94,Default,,0,0,0,,and his family.
Dialogue: 232,0:03:55.54,0:04:22.32,Default,,0,0,0,,I walk into a diner.  <-------- this time with delay 
Dialogue: 233,0:04:22.56,0:04:23.74,Default,,0,0,0,,I either meet him first,
Dialogue: 234,0:04:24.12,0:04:25.18,Default,,0,0,0,,a day later, or my young
Dialogue: 235,0:04:25.18,0:04:26.04,Default,,0,0,0,,man. My younger did next.
Dialogue: 236,0:04:26.42,0:04:27.46,Default,,0,0,0,,Dad's getting bullied by
Dialogue: 237,0:04:27.46,0:04:28.44,Default,,0,0,0,,a younger biff or about
Dialogue: 238,0:04:28.44,0:04:28.78,Default,,0,0,0,,to be,

The audio I ran the transcription on

https://github.com/jianfch/stable-ts/assets/109358640/7fe7b427-49ed-454f-ac28-a963187d940f

electro199 commented 1 year ago

Also in lattest versions it is impossible to use

tag=(r"{\1c&H34ebde&}", r"{\r}"),

due to added changes that never allow the function which is responsible to use tags

I have a fix for that

def result_to_ass(result: (dict, list),
                  filepath: str = None,
                  segment_level=True,
                  word_level=True,
                  min_dur: float = 0.02,
                  tag: Tuple[str, str] = None,
                  font: str = None,
                  use_tag = True, # this change here
                  font_size: int = 24,
                  strip=True,
                  highlight_color: str = None,
                  karaoke=False,
                  reverse_text: Union[bool, tuple] = False,
                  **kwargs):
    """

    Generate Advanced SubStation Alpha (ASS) file from result to display segment-level and/or word-level timestamp.

    Note: ass file is used in the same way as srt, vtt, etc.

    string of content if no [filepath] is provided, else None

    """
    if highlight_color is None and (karaoke or (word_level and segment_level)):
        highlight_color = '00ff00'

    ...
    ...

    return result_to_any(
        result=result,
        filepath=filepath,
        filetype='ass',
        segments2blocks=segments2blocks,
        segment_level=segment_level,
        word_level=word_level,
        min_dur=min_dur,
        tag=tag,
        default_tag=(r'{\1c' + f'{highlight_color}&' + '}', r'{\r}'),
        strip=strip,
        reverse_text=reverse_text,
        to_word_level_string_callback= None if use_tag else   ( # and here
            lambda s, t: to_ass_word_level_segments(s, t, karaoke=karaoke)
            if karaoke or (word_level and segment_level)
            else None
        )
    )
jianfch commented 1 year ago

Thanks for pointing issue with result_to_ass(). tag should work as intended in latest commit. Note that tag is ignored if word_level=False. I couldn't replicate the timing issue with audio clip you've provided using the default options. Do you know the arguments you passed into transcribe() with **options?

electro199 commented 1 year ago

Yeah, I was using word_level=False to turn off effects otherwise, it will use the default green effect and the not-given tag. the error happens randomly even when with same audio gives the right result and sometimes, it does not. I also noticed if it ran from cli with JSON output it is less likely to give the errors.

The **options is "en" for language detection.

more info may help in recreating the error

No GPU was used, the base model was used,

electro199 commented 1 year ago

if you try to transcribe multiple times then you may able to recreate the error. transcribing in python program cause errors most of the time for me.

jianfch commented 1 year ago

This nondeterministic behavior might be due to the temperature fallback. Use temperature=0 or --temperature_increment_on_fallback None for CLI to make it deterministic.

electro199 commented 1 year ago

Sure I will try this

electro199 commented 1 year ago

After testing multiple I am still having the issue (although much less)

electro199 commented 1 year ago

This time transcription had the same line repeating 70% of the script (with the initial promote I was not using initial promote before)

electro199 commented 1 year ago

Also transcription on small audio works fine

electro199 commented 1 year ago

Also, some results shows old behavior of making some empty delay and then stretching the other segment

electro199 commented 1 year ago

Transcriptions with audio less than 60 sec are issued 1 out of 10 times and the one with the error was all last segments missing 10 seconds in which length of the audio was around 58 seconds

jianfch commented 1 year ago

After testing multiple I am still having the issue (although much less)

If it is still not deterministic after setting temperature to 0, this issue is likely caused by factors outside of Stable-ts. It is like multiplying the same numbers but get different results each time. You can try it with just Whisper to see if you get similar behavior.

electro199 commented 1 year ago

After testing with Whisper I am not getting any issue

jianfch commented 1 year ago

The transcripts were all the exact same across hundreds of times I tested on my end with the audio you provided. Also ran small test in colab and results were also consistent: https://colab.research.google.com/drive/1eqFZqXAIR_NNvgfI-SB-1Fqkd0GMOm2r?usp=sharing.

electro199 commented 1 year ago

Is there any logger that I can use to get more info on why it is happening on my application ?

jianfch commented 1 year ago

Can you share results as json files of a run with the issue and one without it? If you're using version 2.9+, try to see if you can reproduce same issue with version 2.8.1. If the issue doesn't occur with 2.8.1, then it is likely a bug with 2.9.

electro199 commented 1 year ago

Yes, I can send you raw JSON files.

I have tested on mostly 2.6.4 although tested newer versions I got the same issue.

I am now using vanilla Whisper with python dict to ass by stable-ts it is working fine for me.

jianfch commented 1 year ago

The JSON could help narrow down the cause of the issue. Have you tested the latest version 2.9.0? There were changes to how it chooses where it begins transcribing every audio chunk.