Large model starts to repeat itself / gets stuck on a phrase

timmermansjoy commented 1 year ago

Hi there. using the large model it sometimes starts to go in a loop or so and just repeat a sentence the rest of the transcript. with the following command:

./main -m models/ggml-large.bin "$output_file" -t 11 -l nl --output-txt --print-colors --best-of 3

the audio file is 20 min long. but I have seen it with other files aswell

Screenshot 2023-05-15 at 09 26 04

mrfragger commented 1 year ago

table: nothingbutdots I have it happen kinda. Using medium.en model here I have transcribing working till 13 h 57m mark then just . . . . . till 25 hours at which point I killed it. It was a 36h 2 m audio segment. I'm gonna keep audio segments around 30 hours to hopefully avoid this issue.

dhx commented 1 year ago

Possibly related to: https://github.com/openai/whisper/pull/1253

jingyibo123 commented 9 months ago

This can be easily reproduced with the sample:

./main -m ./models/ggml-large-v3-q5_0.bin -f samples/gb1.wav

whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:      CPU buffer size =  1080.97 MB
whisper_model_load: model size    = 1080.47 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.42 MB
whisper_init_state: compute buffer (encode) =  212.42 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =   99.24 MB

system_info: n_threads = 1 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/gb1.wav' (3179927 samples, 198.7 sec), 1 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.980 --> 00:00:08.720]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:08.720 --> 00:00:17.280]   At 9:00 this morning, Mission Control in Houston lost contact with our space shuttle Columbia.
[00:00:17.280 --> 00:00:24.640]   A short time later, debris was seen falling from the skies above Texas.
[00:00:24.640 --> 00:00:27.200]   The Columbia is lost.
[00:00:27.200 --> 00:00:29.860]   There are no survivors.
[00:00:29.860 --> 00:00:32.920]   On board was a crew of seven.
[00:00:32.920 --> 00:00:39.760]   Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain
[00:00:39.760 --> 00:00:50.120]   David Brown, Commander William McCool, Dr. Kulpna Shavla, and Ilan Ramon, a colonel in
[00:00:50.120 --> 00:00:52.780]   the Israeli Air Force.
[00:00:52.780 --> 00:00:59.720]   These men and women assumed great risk in the service to all humanity in an age when
[00:00:59.720 --> 00:01:03.100]   flight has come to seem almost routine.
[00:01:03.100 --> 00:01:08.720]   It is easy to overlook the dangers of travel by rocket and the difficulties of navigating
[00:01:08.720 --> 00:01:12.580]   the fierce outer atmosphere of the Earth.
[00:01:12.580 --> 00:01:19.220]   These astronauts knew the dangers, and they faced them willingly, knowing they had a high
[00:01:19.220 --> 00:01:22.940]   and noble purpose in life.
[00:01:22.940 --> 00:01:29.580]   Because of their courage and daring and idealism, we will miss them all the more.
[00:01:29.580 --> 00:01:36.360]   All Americans today are thinking as well of the families of these men and women who
[00:01:36.360 --> 00:01:40.440]   have been given this sudden shock and grief.
[00:01:40.440 --> 00:01:42.340]   You're not alone.
[00:01:42.340 --> 00:01:45.420]   Our entire nation grieves with you.
[00:01:45.420 --> 00:01:52.340]   And those you loved will always have the respect and gratitude of this country.
[00:01:52.340 --> 00:01:57.060]   The cause in which they died will continue.
[00:01:57.060 --> 00:01:59.440]   Mankind is led into the darkness.
[00:01:59.440 --> 00:02:02.200]   But we will not be left behind.
[00:02:02.200 --> 00:02:04.200]   We will be led into the darkness.
[00:02:04.200 --> 00:02:06.200]   We will be led into the darkness.
[00:02:06.200 --> 00:02:08.200]   We will be led into the darkness.
[00:02:08.200 --> 00:02:10.200]   We will be led into the darkness.
[00:02:10.200 --> 00:02:12.200]   We will be led into the darkness.
[00:02:12.200 --> 00:02:14.200]   We will be led into the darkness.
[00:02:14.200 --> 00:02:16.200]   We will be led into the darkness.
[00:02:16.200 --> 00:02:18.200]   We will be led into the darkness.
[00:02:18.200 --> 00:02:20.200]   We will be led into the darkness.
[00:02:20.200 --> 00:02:22.200]   We will be led into the darkness.
[00:02:22.200 --> 00:02:24.200]   We will be led into the darkness.
[00:02:24.200 --> 00:02:26.200]   We will be led into the darkness.
[00:02:26.200 --> 00:02:28.200]   We will be led into the darkness.
[00:02:28.200 --> 00:02:29.300]   We will be led into the darkness.

mtrazzi commented 9 months ago

any updates on this? I had the same problem using the large model v3

Lavrikov commented 9 months ago

try to use flag -mc 0. It helps to avoid adding previous text prompt to new one

brbrainerd commented 3 months ago

try to use flag -mc 0. It helps to avoid adding previous text prompt to new one

Great solution. I thought I'd post a Python function that removes sequentially repeated lines in case you would like to keep your token history. It's worked successfully on a media library with ~14,000 videos:

def check_repeated_lines(vtt_file):
    """Check for and remove repeated subtitle lines in the VTT file."""
    logging.debug(f"Checking for repeated lines in {vtt_file}")
    with open(vtt_file, 'r') as file:
        content = file.readlines()

    cleaned_content = []
    previous_line = None
    skip_next = False

    i = 0
    while i < len(content):
        line = content[i]
        if re.match(r'^[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}$', line):
            if skip_next:
                skip_next = False
                i += 2  # Skip the current timestamp and the next line
                continue
            cleaned_content.append(line)
            if i + 1 < len(content):
                next_line = content[i + 1].strip()
                if next_line == previous_line:
                    cleaned_content.pop()  # Remove the last timestamp
                    skip_next = True
                else:
                    cleaned_content.append(content[i + 1])
                previous_line = next_line
            i += 2
        else:
            cleaned_content.append(line)
            i += 1

    # Remove any remaining blank lines
    cleaned_content = [line for line in cleaned_content if line.strip()]

    with open(vtt_file, 'w') as file:
        file.writelines(cleaned_content)

    logging.debug(f"Finished cleaning repeated lines in {vtt_file}")
    return False

ggerganov / whisper.cpp

Large model starts to repeat itself / gets stuck on a phrase #924