ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.1k stars 3.58k forks source link

Duplicate words generated #896

Open leohuang2013 opened 1 year ago

leohuang2013 commented 1 year ago

I used latest commit: bf2449d with model: ggml-small.bin by executing command, $> bin/main -m ../models/ggml-small.bin ~/tmp/wrongResultWithWhisper.wav in macOS.

Output has many duplicate words as below, [00:00:33.000 --> 00:00:44.000] To this index, Earth has a rating of 0.829, but Kepler 442B has a rating of 0.836. [00:00:44.000 --> 00:00:50.000] This is not certain because Kepler 442B's atmosphere and surface are unknown, [00:00:50.000 --> 00:00:53.000] but this would be possible. [00:00:54.000 --> 00:00:59.000] Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:00:59.000 --> 00:01:04.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:04.000 --> 00:01:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:09.000 --> 00:01:14.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:14.000 --> 00:01:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:19.000 --> 00:01:24.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:24.000 --> 00:01:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:29.000 --> 00:01:34.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:34.000 --> 00:01:39.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:39.000 --> 00:01:43.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:43.000 --> 00:01:49.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:49.000 --> 00:01:54.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:54.000 --> 00:01:59.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:59.000 --> 00:02:04.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:04.000 --> 00:02:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:09.000 --> 00:02:14.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:14.000 --> 00:02:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:19.000 --> 00:02:24.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:24.000 --> 00:02:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:29.000 --> 00:02:33.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:33.000 --> 00:02:38.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:38.000 --> 00:02:43.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:43.000 --> 00:02:48.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning,

I attacked sample wav file. wrongResultWithWhisper.wav.zip

abelbabel commented 1 year ago

Can confirm that recent commits that claimed to resolve the word duplication issues did not resolve them.

leohuang2013 commented 1 year ago

I just tried this commit: https://github.com/ggerganov/whisper.cpp/commit/f19e23fbd108ec3ac458c7a19b31c930719e7a94 which was mentioned in this link, https://github.com/ggerganov/whisper.cpp/issues/612

I got same result: [00:00:44.000 --> 00:00:50.000] This is not certain because Kepler 442B's atmosphere and surface are unknown, [00:00:50.000 --> 00:00:53.000] but this would be possible. [00:00:54.000 --> 00:00:59.000] Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:00:59.000 --> 00:01:04.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:04.000 --> 00:01:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:09.000 --> 00:01:14.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:14.000 --> 00:01:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:19.000 --> 00:01:24.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:24.000 --> 00:01:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning,

hoonlight commented 1 year ago

same problem in m1 pro 14 macbook

chenqianhe commented 1 year ago

This is partly a problem with the model itself.

WhisperHallu deserves attention. And I confirm that removing the voiceless part(use silero-vad) is very effective for me.

leohuang2013 commented 1 year ago

Just tested with openai whisper. It does not have such issue.

$> whisper --model base wrongResultWithWhisper.wav

pdw207 commented 1 year ago

This appears to be related to closed issues #471 #477 #508 #612 #719 #731 and an attempted fix released in v1.3.0

Here are excerpts of the duplication seen built after release v1.4.2 from main (77eab3f). Full output can be seen here.

Output

…
[00:05:58.000 --> 00:06:08.000]   [ Background noise ]
[00:06:08.000 --> 00:06:18.000]   [ Background noise ]
[00:06:18.000 --> 00:06:28.000]   [ Background noise ]
[00:06:28.000 --> 00:06:38.000]   [ Background noise ]
[00:06:38.000 --> 00:06:48.000]   [ Background noise ]
[00:06:48.000 --> 00:06:58.000]   [ Background noise ]
[00:06:58.000 --> 00:07:08.000]   [ Background noise ] 
[00:07:08.000 --> 00:07:13.000]   [ Background noise ]
[00:07:13.000 --> 00:07:18.000]   [ Background noise ]
[00:07:18.000 --> 00:07:28.000]   [ Background noise ]←- The speaker starts here and while clearly audible is not transcribed
[00:07:28.000 --> 00:07:38.000]   [ Background noise ]
[00:07:38.000 --> 00:07:48.000]   [ Background noise ]
[00:07:48.000 --> 00:07:58.000]   [ Background noise ]
[00:07:58.000 --> 00:08:08.000]   [ Background noise ]
…
[00:41:15.000 --> 00:41:16.000]   You picked…  ←- There is cross-talk but this is repeated in the transcription 
[00:41:16.000 --> 00:41:17.000]   You picked...
[00:41:17.000 --> 00:41:18.000]   You picked...
[00:41:18.000 --> 00:41:19.000]   You picked...
…
[00:42:16.000 --> 00:42:18.000]   He has never done a single thing.  ←- There is minor cross-talk but this is repeated in the transcription 
[00:42:18.000 --> 00:42:20.000]   He has never done a single thing.
[00:42:20.000 --> 00:42:25.000]   He has never done a single thing.
[00:42:25.000 --> 00:42:26.000]   He has never done a single thing.
[00:42:26.000 --> 00:42:27.000]   He has never done a single thing.
[00:42:27.000 --> 00:42:28.000]   He has never done a single thing.
[00:42:28.000 --> 00:42:29.000]   He has never done a single thing.
[00:42:29.000 --> 00:42:30.000]   He has never done a single thing.
[00:42:30.000 --> 00:42:31.000]   He has never done a single thing.
[00:42:31.000 --> 00:42:32.000]   He has never done a single thing.
[00:42:32.000 --> 00:42:33.000]   He has never done a single thing.
…
[00:48:23.000 --> 00:48:25.000]   You don't know how many people died in Russia.
[00:48:25.000 --> 00:48:27.000]   You don't know how many people died in Russia.
[00:48:27.000 --> 00:48:29.000]   You don't know how many people died in Russia.
[00:48:29.000 --> 00:48:31.000]   You don't know how many people died in Russia.
[00:48:31.000 --> 00:48:33.000]   You don't know how many people died in Russia.
…

Steps to Reproduce:

Audio from US presidental debate

./models/download-ggml-model.sh base.en
make
curl -o ./samples/us-debates.m4a https://public-bucket-palmar.s3.amazonaws.com/test-files/us-debates.m4a  
ffmpeg -i ./samples/us-debates.m4a -ar 16000 ./samples/us-debates.wav
./main -m ./models/ggml-tiny.en.bin -f ./samples/us-debates.wav -otxt

Output:

whisper_init_from_file_no_state: loading model from './models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: mem required  =  310.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | 
main: processing './samples/us-debates.wav' (119353911 samples, 7459.6 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =   102.34 ms
whisper_print_timings:     fallbacks =   8 p /   5 h
whisper_print_timings:      mel time = 10217.33 ms
whisper_print_timings:   sample time = 14363.63 ms / 29991 runs (    0.48 ms per run)
whisper_print_timings:   encode time = 129503.91 ms /   372 runs (  348.13 ms per run)
whisper_print_timings:   decode time = 107696.36 ms / 29985 runs (    3.59 ms per run)
whisper_print_timings:    total time = 262345.28 ms

Debug Output:

Output with WHISPER_DEBUG defined. Using defaults, entropy of 2.40, beam size -1, best of 2.

whisper_full_with_state: decoder  0: score = -0.23921, result_len =   6, avg_logprobs = -0.23921, entropy =  1.79176
whisper_full_with_state: best decoder = 0
[00:07:28.000 --> 00:07:38.000]   [ Background noise ]
seek = 45800, seek_delta = 1000
…
whisper_full_with_state: decoder  0: score = -0.15139, result_len = 149, avg_logprobs = -0.15139, entropy =  2.19991
whisper_full_with_state: decoder  0: failed due to entropy  2.19991 <  2.40000
whisper_full_with_state: decoder  1: score = -0.02645, result_len = 220, avg_logprobs = -0.02645, entropy =  2.44152
whisper_full_with_state: best decoder = 1
[00:42:20.000 --> 00:42:25.000]   He has never done a single thing.
[00:42:25.000 --> 00:42:26.000]   He has never done a single thing.
[00:42:26.000 --> 00:42:27.000]   He has never done a single thing.
[00:42:27.000 --> 00:42:28.000]   He has never done a single thing.
[00:42:28.000 --> 00:42:29.000]   He has never done a single thing.
[00:42:29.000 --> 00:42:30.000]   He has never done a single thing.
[00:42:30.000 --> 00:42:31.000]   He has never done a single thing.
[00:42:31.000 --> 00:42:32.000]   He has never done a single thing.
[00:42:32.000 --> 00:42:33.000]   He has never done a single thing.
[00:42:33.000 --> 00:42:34.000]   He has never done a single thing.
[00:42:34.000 --> 00:42:35.000]   He has never done a single thing.
[00:42:35.000 --> 00:42:36.000]   He has never done a single thing.
[00:42:36.000 --> 00:42:37.000]   He has never done a single thing.
[00:42:37.000 --> 00:42:38.000]   He has never done a single thing.
[00:42:38.000 --> 00:42:39.000]   He has never done a single thing.
[00:42:39.000 --> 00:42:40.000]   He has never done a single thing.
[00:42:40.000 --> 00:42:41.000]   He has never done a single thing.
[00:42:41.000 --> 00:42:42.000]   He has never done a single thing.
[00:42:42.000 --> 00:42:43.000]   He has never done a single thing.
[00:42:43.000 --> 00:42:44.000]   He has never done a single thing.
[00:42:44.000 --> 00:42:45.000]   He has never done a single thing.
[00:42:45.000 --> 00:42:46.000]   He has never done a single thing.
seek = 256600, seek_delta = 2600
…
whisper_full_with_state: decoder  0: score = -0.18104, result_len = 183, avg_logprobs = -0.18104, entropy =  2.62054
whisper_full_with_state: best decoder = 0
[00:48:23.000 --> 00:48:25.000]   You don't know how many people died in Russia.
[00:48:25.000 --> 00:48:27.000]   You don't know how many people died in Russia.
[00:48:27.000 --> 00:48:29.000]   You don't know how many people died in Russia.
[00:48:29.000 --> 00:48:31.000]   You don't know how many people died in Russia.
[00:48:31.000 --> 00:48:33.000]   You don't know how many people died in Russia.
[00:48:33.000 --> 00:48:35.000]   You don't know how many people died in Russia.
[00:48:35.000 --> 00:48:37.000]   You don't know how many people died in Russia.
[00:48:37.000 --> 00:48:39.000]   You don't know how many people died in Russia.
[00:48:39.000 --> 00:48:41.000]   You don't know how many people died in Russia.
[00:48:41.000 --> 00:48:43.000]   You don't know how many people died in Russia.
[00:48:43.000 --> 00:48:45.000]   You don't know how many people died in Russia.
[00:48:45.000 --> 00:48:47.000]   You don't know how many people died in Russia.
[00:48:47.000 --> 00:48:49.000]   You don't know how many people died in Russia.
[00:48:49.000 --> 00:48:51.000]   You don't know how many people died in Russia.
seek = 293100, seek_delta = 2800
pdw207 commented 1 year ago

In response to https://github.com/ggerganov/whisper.cpp/issues/508#issuecomment-1435907929 I experimented with raising the entropy threshold (2.8 and 3.5) and it does avoid specific duplication but does not solve all cases and I'm not sure I fully understand the tradeoffs in all fine-tuning parameters. Looking for suggestions on beam size as well.

I am trying to optimize for quality over processing time. Possibly a naive question but because there are a number of parameters to fine-tune is there guidance relating to temperature, fallback_temperature, beam_size, best_of count, and entropy settings to avoid this behavior? Or, as an alternative, are there defaults from OpenAI's implementation which we can mirror, or can a preprocessing stage or transcription strategy (breaking up long audio files) reduce the likelihood of this error? I see a comment in the thread about building with a different optimization level but not sure if there is guidance on how to do that or if that is a recommended strategy.

Model is hallucinating. You can improving the behavior by trying -bo 7 or some number larger than the default of 5. The other thing is to try building with a different optimization level. Try -O3 instead of -O2, or vice versa.

It appears entropy 2.8 would have resolved the issue but additional duplicated lines are created with higher entropy or if this threshold created other issues with the transcription being overly cautious. Not sure about the "failed due to entropy" error.

...
[01:12:04.960 --> 01:12:05.960]   He blew it. <-  entropy =  2.94588
[01:12:05.960 --> 01:12:06.960]   He blew it.
[01:12:06.960 --> 01:12:07.960]   He blew it.
[01:12:07.960 --> 01:12:08.960]   He blew it.
...
[01:12:11.960 --> 01:12:12.960]   It was a threat. <-  entropy =  2.94588
[01:12:12.960 --> 01:12:13.960]   It was a threat.
[01:12:13.960 --> 01:12:14.960]   It was a threat.
[01:12:14.960 --> 01:12:15.960]   It was a threat.
...
[01:30:06.780 --> 01:30:07.780]   That's not true. <- entropy =  3.18945
[01:30:07.780 --> 01:30:08.780]   That's not true.
[01:30:08.780 --> 01:30:09.780]   That's not true.
[01:30:09.780 --> 01:30:10.780]   That's not true.
...
whisper_full_with_state: decoder  0: score = -0.22626, result_len = 202, avg_logprobs = -0.22626, entropy =  2.90255
whisper_full_with_state: decoder  1: score = -0.19905, result_len = 214, avg_logprobs = -0.19905, entropy =  2.46849
whisper_full_with_state: decoder  1: failed due to entropy  2.46849 <  2.80000
whisper_full_with_state: best decoder = 0
[01:34:05.040 --> 01:34:06.840]   We're moving on to the next one.
[01:34:06.840 --> 01:34:07.840]   We're moving on to the next one.
[01:34:07.840 --> 01:34:08.840]   We're moving on to the next one.
[01:34:08.840 --> 01:34:09.840]   We're moving on to the next one.
[01:34:09.840 --> 01:34:11.840]   We're moving on to the next one.
pdw207 commented 1 year ago

In reference to the audio file used to highlight the issue in https://github.com/ggerganov/whisper.cpp/issues/896#issuecomment-1562283987

@jordibruin I see this audio file performs reasonably well in MacWhisper. Did you face this issue and set a higher entropy threshold or beam size?

@ggerganov any guidance you could provide?

ggerganov commented 1 year ago

@pdw207

https://github.com/ggerganov/whisper.cpp/blob/77eab3fbfe5e5462021d92dd230076bba06eefbc/whisper.cpp#L3329

pdw207 commented 1 year ago

@ggerganov Appreciate the detailed response as those settings did resolve the issue.

leohuang2013 commented 1 year ago

@ggerganov Are those setting correct?

    params.n_max_text_ctx = 64; 
    params.temperature_inc = 0.1f; 
    params.beam_search.beam_size = 5;
    params.entropy_thold = 2.8f;

params is whisper_full_params.

Others settings are as following

    params.print_realtime = false;
    params.print_progress = false;
    params.print_timestamps = false;
    params.print_special = false;
    params.translate = false;
    params.language = m_languageCode.c_str();
    params.n_threads = maxThreads;
    params.offset_ms = 0;
    params.no_context = false; // Since we read audio file block by block
    params.single_segment = false;
    params.token_timestamps = true;
    params.progress_callback = internalProgressCallback;
    params.progress_callback_user_data = this;
    params.greedy.best_of = 2;
    params.thold_pt = 0.01f;
    params.thold_ptsum = 0.01f;
    params.no_speech_thold = 0.6f;
    params.logprob_thold = -1.0f;
    params.length_penalty = -1;
    params.new_segment_callback = internalSegmentCallback;
    params.new_segment_callback_user_data = this;
    // suppress tokens, like music, clap, see whisper.cpp:3225
    // Don't set this true, it will affect accuracy. Don't know why
    params.suppress_non_speech_tokens = false;

After changed to above settings, still got same duplicated words.

pdw207 commented 1 year ago

@leohuang2013 Do you have an audio file you can share and steps to reproduce?

leohuang2013 commented 1 year ago

@leohuang2013 Do you have an audio file you can share and steps to reproduce? This is the file I used.

wrongResultWithWhisper.wav.zip

eual8 commented 1 year ago

@pdw207

  • Currently the temperature step is set to 0.4. Try to decrease it to 0.1 as in the original Whisper implementation:

https://github.com/ggerganov/whisper.cpp/blob/77eab3fbfe5e5462021d92dd230076bba06eefbc/whisper.cpp#L3329

  • Increase beam size to 5: -bs 5
  • Adjust entropy threshold -et 2.8
  • Reduce max context size -mc 64
  • Use larger model

Thanks, this helped me for a large model (ggml-model.bin)

togume commented 11 months ago

Came here to solve this same problem I encountered when running large V3 + CoreML. Things got stuck repeating itself at the end of a 2hr22min recording.

I was able to get it unstuck using -bs 5 -et 2.8 -mc 64 and didn't change the temperature.

I'd love to figure out how to make it as efficient as possible to process large amounts of audio without getting stuck repeating itself. I'll keep experimenting, and please let me know if anyone has any ideas.

togume commented 11 months ago

Update: it's also repeating itself with defaults on a small piece of audio in Spanish.

KNWR commented 5 months ago

I was still running into issues with some of the above. As a workaround, I've been using a script to split the audio into smaller chunks. Script here if it helps anyone: https://github.com/ggerganov/whisper.cpp/issues/1851#issuecomment-2119262466