Open leohuang2013 opened 1 year ago
Can confirm that recent commits that claimed to resolve the word duplication issues did not resolve them.
I just tried this commit: https://github.com/ggerganov/whisper.cpp/commit/f19e23fbd108ec3ac458c7a19b31c930719e7a94 which was mentioned in this link, https://github.com/ggerganov/whisper.cpp/issues/612
I got same result: [00:00:44.000 --> 00:00:50.000] This is not certain because Kepler 442B's atmosphere and surface are unknown, [00:00:50.000 --> 00:00:53.000] but this would be possible. [00:00:54.000 --> 00:00:59.000] Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:00:59.000 --> 00:01:04.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:04.000 --> 00:01:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:09.000 --> 00:01:14.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:14.000 --> 00:01:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:19.000 --> 00:01:24.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:24.000 --> 00:01:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning,
same problem in m1 pro 14 macbook
This is partly a problem with the model itself.
WhisperHallu deserves attention. And I confirm that removing the voiceless part(use silero-vad) is very effective for me.
Just tested with openai whisper. It does not have such issue.
$> whisper --model base wrongResultWithWhisper.wav
This appears to be related to closed issues #471 #477 #508 #612 #719 #731 and an attempted fix released in v1.3.0
Here are excerpts of the duplication seen built after release v1.4.2 from main (77eab3f). Full output can be seen here.
…
[00:05:58.000 --> 00:06:08.000] [ Background noise ]
[00:06:08.000 --> 00:06:18.000] [ Background noise ]
[00:06:18.000 --> 00:06:28.000] [ Background noise ]
[00:06:28.000 --> 00:06:38.000] [ Background noise ]
[00:06:38.000 --> 00:06:48.000] [ Background noise ]
[00:06:48.000 --> 00:06:58.000] [ Background noise ]
[00:06:58.000 --> 00:07:08.000] [ Background noise ]
[00:07:08.000 --> 00:07:13.000] [ Background noise ]
[00:07:13.000 --> 00:07:18.000] [ Background noise ]
[00:07:18.000 --> 00:07:28.000] [ Background noise ]←- The speaker starts here and while clearly audible is not transcribed
[00:07:28.000 --> 00:07:38.000] [ Background noise ]
[00:07:38.000 --> 00:07:48.000] [ Background noise ]
[00:07:48.000 --> 00:07:58.000] [ Background noise ]
[00:07:58.000 --> 00:08:08.000] [ Background noise ]
…
[00:41:15.000 --> 00:41:16.000] You picked… ←- There is cross-talk but this is repeated in the transcription
[00:41:16.000 --> 00:41:17.000] You picked...
[00:41:17.000 --> 00:41:18.000] You picked...
[00:41:18.000 --> 00:41:19.000] You picked...
…
[00:42:16.000 --> 00:42:18.000] He has never done a single thing. ←- There is minor cross-talk but this is repeated in the transcription
[00:42:18.000 --> 00:42:20.000] He has never done a single thing.
[00:42:20.000 --> 00:42:25.000] He has never done a single thing.
[00:42:25.000 --> 00:42:26.000] He has never done a single thing.
[00:42:26.000 --> 00:42:27.000] He has never done a single thing.
[00:42:27.000 --> 00:42:28.000] He has never done a single thing.
[00:42:28.000 --> 00:42:29.000] He has never done a single thing.
[00:42:29.000 --> 00:42:30.000] He has never done a single thing.
[00:42:30.000 --> 00:42:31.000] He has never done a single thing.
[00:42:31.000 --> 00:42:32.000] He has never done a single thing.
[00:42:32.000 --> 00:42:33.000] He has never done a single thing.
…
[00:48:23.000 --> 00:48:25.000] You don't know how many people died in Russia.
[00:48:25.000 --> 00:48:27.000] You don't know how many people died in Russia.
[00:48:27.000 --> 00:48:29.000] You don't know how many people died in Russia.
[00:48:29.000 --> 00:48:31.000] You don't know how many people died in Russia.
[00:48:31.000 --> 00:48:33.000] You don't know how many people died in Russia.
…
Audio from US presidental debate
./models/download-ggml-model.sh base.en
make
curl -o ./samples/us-debates.m4a https://public-bucket-palmar.s3.amazonaws.com/test-files/us-debates.m4a
ffmpeg -i ./samples/us-debates.m4a -ar 16000 ./samples/us-debates.wav
./main -m ./models/ggml-tiny.en.bin -f ./samples/us-debates.wav -otxt
whisper_init_from_file_no_state: loading model from './models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2
whisper_model_load: mem required = 310.00 MB (+ 6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.66 MB
whisper_model_load: model size = 140.54 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing './samples/us-debates.wav' (119353911 samples, 7459.6 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
whisper_print_timings: load time = 102.34 ms
whisper_print_timings: fallbacks = 8 p / 5 h
whisper_print_timings: mel time = 10217.33 ms
whisper_print_timings: sample time = 14363.63 ms / 29991 runs ( 0.48 ms per run)
whisper_print_timings: encode time = 129503.91 ms / 372 runs ( 348.13 ms per run)
whisper_print_timings: decode time = 107696.36 ms / 29985 runs ( 3.59 ms per run)
whisper_print_timings: total time = 262345.28 ms
Output with WHISPER_DEBUG defined. Using defaults, entropy of 2.40, beam size -1, best of 2.
whisper_full_with_state: decoder 0: score = -0.23921, result_len = 6, avg_logprobs = -0.23921, entropy = 1.79176
whisper_full_with_state: best decoder = 0
[00:07:28.000 --> 00:07:38.000] [ Background noise ]
seek = 45800, seek_delta = 1000
…
whisper_full_with_state: decoder 0: score = -0.15139, result_len = 149, avg_logprobs = -0.15139, entropy = 2.19991
whisper_full_with_state: decoder 0: failed due to entropy 2.19991 < 2.40000
whisper_full_with_state: decoder 1: score = -0.02645, result_len = 220, avg_logprobs = -0.02645, entropy = 2.44152
whisper_full_with_state: best decoder = 1
[00:42:20.000 --> 00:42:25.000] He has never done a single thing.
[00:42:25.000 --> 00:42:26.000] He has never done a single thing.
[00:42:26.000 --> 00:42:27.000] He has never done a single thing.
[00:42:27.000 --> 00:42:28.000] He has never done a single thing.
[00:42:28.000 --> 00:42:29.000] He has never done a single thing.
[00:42:29.000 --> 00:42:30.000] He has never done a single thing.
[00:42:30.000 --> 00:42:31.000] He has never done a single thing.
[00:42:31.000 --> 00:42:32.000] He has never done a single thing.
[00:42:32.000 --> 00:42:33.000] He has never done a single thing.
[00:42:33.000 --> 00:42:34.000] He has never done a single thing.
[00:42:34.000 --> 00:42:35.000] He has never done a single thing.
[00:42:35.000 --> 00:42:36.000] He has never done a single thing.
[00:42:36.000 --> 00:42:37.000] He has never done a single thing.
[00:42:37.000 --> 00:42:38.000] He has never done a single thing.
[00:42:38.000 --> 00:42:39.000] He has never done a single thing.
[00:42:39.000 --> 00:42:40.000] He has never done a single thing.
[00:42:40.000 --> 00:42:41.000] He has never done a single thing.
[00:42:41.000 --> 00:42:42.000] He has never done a single thing.
[00:42:42.000 --> 00:42:43.000] He has never done a single thing.
[00:42:43.000 --> 00:42:44.000] He has never done a single thing.
[00:42:44.000 --> 00:42:45.000] He has never done a single thing.
[00:42:45.000 --> 00:42:46.000] He has never done a single thing.
seek = 256600, seek_delta = 2600
…
whisper_full_with_state: decoder 0: score = -0.18104, result_len = 183, avg_logprobs = -0.18104, entropy = 2.62054
whisper_full_with_state: best decoder = 0
[00:48:23.000 --> 00:48:25.000] You don't know how many people died in Russia.
[00:48:25.000 --> 00:48:27.000] You don't know how many people died in Russia.
[00:48:27.000 --> 00:48:29.000] You don't know how many people died in Russia.
[00:48:29.000 --> 00:48:31.000] You don't know how many people died in Russia.
[00:48:31.000 --> 00:48:33.000] You don't know how many people died in Russia.
[00:48:33.000 --> 00:48:35.000] You don't know how many people died in Russia.
[00:48:35.000 --> 00:48:37.000] You don't know how many people died in Russia.
[00:48:37.000 --> 00:48:39.000] You don't know how many people died in Russia.
[00:48:39.000 --> 00:48:41.000] You don't know how many people died in Russia.
[00:48:41.000 --> 00:48:43.000] You don't know how many people died in Russia.
[00:48:43.000 --> 00:48:45.000] You don't know how many people died in Russia.
[00:48:45.000 --> 00:48:47.000] You don't know how many people died in Russia.
[00:48:47.000 --> 00:48:49.000] You don't know how many people died in Russia.
[00:48:49.000 --> 00:48:51.000] You don't know how many people died in Russia.
seek = 293100, seek_delta = 2800
In response to https://github.com/ggerganov/whisper.cpp/issues/508#issuecomment-1435907929 I experimented with raising the entropy threshold (2.8 and 3.5) and it does avoid specific duplication but does not solve all cases and I'm not sure I fully understand the tradeoffs in all fine-tuning parameters. Looking for suggestions on beam size as well.
I am trying to optimize for quality over processing time. Possibly a naive question but because there are a number of parameters to fine-tune is there guidance relating to temperature, fallback_temperature, beam_size, best_of count, and entropy settings to avoid this behavior? Or, as an alternative, are there defaults from OpenAI's implementation which we can mirror, or can a preprocessing stage or transcription strategy (breaking up long audio files) reduce the likelihood of this error? I see a comment in the thread about building with a different optimization level but not sure if there is guidance on how to do that or if that is a recommended strategy.
Model is hallucinating. You can improving the behavior by trying -bo 7 or some number larger than the default of 5. The other thing is to try building with a different optimization level. Try -O3 instead of -O2, or vice versa.
It appears entropy 2.8 would have resolved the issue but additional duplicated lines are created with higher entropy or if this threshold created other issues with the transcription being overly cautious. Not sure about the "failed due to entropy" error.
...
[01:12:04.960 --> 01:12:05.960] He blew it. <- entropy = 2.94588
[01:12:05.960 --> 01:12:06.960] He blew it.
[01:12:06.960 --> 01:12:07.960] He blew it.
[01:12:07.960 --> 01:12:08.960] He blew it.
...
[01:12:11.960 --> 01:12:12.960] It was a threat. <- entropy = 2.94588
[01:12:12.960 --> 01:12:13.960] It was a threat.
[01:12:13.960 --> 01:12:14.960] It was a threat.
[01:12:14.960 --> 01:12:15.960] It was a threat.
...
[01:30:06.780 --> 01:30:07.780] That's not true. <- entropy = 3.18945
[01:30:07.780 --> 01:30:08.780] That's not true.
[01:30:08.780 --> 01:30:09.780] That's not true.
[01:30:09.780 --> 01:30:10.780] That's not true.
...
whisper_full_with_state: decoder 0: score = -0.22626, result_len = 202, avg_logprobs = -0.22626, entropy = 2.90255
whisper_full_with_state: decoder 1: score = -0.19905, result_len = 214, avg_logprobs = -0.19905, entropy = 2.46849
whisper_full_with_state: decoder 1: failed due to entropy 2.46849 < 2.80000
whisper_full_with_state: best decoder = 0
[01:34:05.040 --> 01:34:06.840] We're moving on to the next one.
[01:34:06.840 --> 01:34:07.840] We're moving on to the next one.
[01:34:07.840 --> 01:34:08.840] We're moving on to the next one.
[01:34:08.840 --> 01:34:09.840] We're moving on to the next one.
[01:34:09.840 --> 01:34:11.840] We're moving on to the next one.
In reference to the audio file used to highlight the issue in https://github.com/ggerganov/whisper.cpp/issues/896#issuecomment-1562283987
@jordibruin I see this audio file performs reasonably well in MacWhisper. Did you face this issue and set a higher entropy threshold or beam size?
@ggerganov any guidance you could provide?
@pdw207
0.4
. Try to decrease it to 0.1
as in the original Whisper implementation:-bs 5
-et 2.8
-mc 64
@ggerganov Appreciate the detailed response as those settings did resolve the issue.
@ggerganov Are those setting correct?
params.n_max_text_ctx = 64;
params.temperature_inc = 0.1f;
params.beam_search.beam_size = 5;
params.entropy_thold = 2.8f;
params is whisper_full_params.
Others settings are as following
params.print_realtime = false;
params.print_progress = false;
params.print_timestamps = false;
params.print_special = false;
params.translate = false;
params.language = m_languageCode.c_str();
params.n_threads = maxThreads;
params.offset_ms = 0;
params.no_context = false; // Since we read audio file block by block
params.single_segment = false;
params.token_timestamps = true;
params.progress_callback = internalProgressCallback;
params.progress_callback_user_data = this;
params.greedy.best_of = 2;
params.thold_pt = 0.01f;
params.thold_ptsum = 0.01f;
params.no_speech_thold = 0.6f;
params.logprob_thold = -1.0f;
params.length_penalty = -1;
params.new_segment_callback = internalSegmentCallback;
params.new_segment_callback_user_data = this;
// suppress tokens, like music, clap, see whisper.cpp:3225
// Don't set this true, it will affect accuracy. Don't know why
params.suppress_non_speech_tokens = false;
After changed to above settings, still got same duplicated words.
@leohuang2013 Do you have an audio file you can share and steps to reproduce?
@leohuang2013 Do you have an audio file you can share and steps to reproduce? This is the file I used.
@pdw207
- Currently the temperature step is set to
0.4
. Try to decrease it to0.1
as in the original Whisper implementation:
- Increase beam size to 5:
-bs 5
- Adjust entropy threshold
-et 2.8
- Reduce max context size
-mc 64
- Use larger model
Thanks, this helped me for a large model (ggml-model.bin)
Came here to solve this same problem I encountered when running large V3 + CoreML. Things got stuck repeating itself at the end of a 2hr22min recording.
I was able to get it unstuck using -bs 5 -et 2.8 -mc 64
and didn't change the temperature.
I'd love to figure out how to make it as efficient as possible to process large amounts of audio without getting stuck repeating itself. I'll keep experimenting, and please let me know if anyone has any ideas.
Update: it's also repeating itself with defaults on a small piece of audio in Spanish.
I was still running into issues with some of the above. As a workaround, I've been using a script to split the audio into smaller chunks. Script here if it helps anyone: https://github.com/ggerganov/whisper.cpp/issues/1851#issuecomment-2119262466
I used latest commit: bf2449d with model: ggml-small.bin by executing command, $> bin/main -m ../models/ggml-small.bin ~/tmp/wrongResultWithWhisper.wav in macOS.
Output has many duplicate words as below, [00:00:33.000 --> 00:00:44.000] To this index, Earth has a rating of 0.829, but Kepler 442B has a rating of 0.836. [00:00:44.000 --> 00:00:50.000] This is not certain because Kepler 442B's atmosphere and surface are unknown, [00:00:50.000 --> 00:00:53.000] but this would be possible. [00:00:54.000 --> 00:00:59.000] Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:00:59.000 --> 00:01:04.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:04.000 --> 00:01:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:09.000 --> 00:01:14.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:14.000 --> 00:01:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:19.000 --> 00:01:24.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:24.000 --> 00:01:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:29.000 --> 00:01:34.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:34.000 --> 00:01:39.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:39.000 --> 00:01:43.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:43.000 --> 00:01:49.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:49.000 --> 00:01:54.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:54.000 --> 00:01:59.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:59.000 --> 00:02:04.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:04.000 --> 00:02:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:09.000 --> 00:02:14.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:14.000 --> 00:02:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:19.000 --> 00:02:24.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:24.000 --> 00:02:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:29.000 --> 00:02:33.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:33.000 --> 00:02:38.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:38.000 --> 00:02:43.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:43.000 --> 00:02:48.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning,
I attacked sample wav file. wrongResultWithWhisper.wav.zip