Open WenqingZong opened 9 months ago
Hi,
Thank you for including the actual samples with your report. I was able to reproduce this exact behavior on my own CPU-only setup. Yup, it's a bug. (And no, I don't have a fix.)
Frankly, my recommendation is to forget about candle and use one of the pythonesque Whisper variants instead, if you can. I've had good results with whisper-timestamped and whisper.cpp, for example. Btw, whisper-timestamped also has trouble with these samples without the --suppress-tokens "" trick, but it's not exactly the same (because of different codebase).
If you really, really want to use candle, you might want to consider using a bigger (and quantized?) model. VAD preprocessing might also help some.
NB. Don't take this as a criticism against candle, it's a fantastic project but imho it's not the best choice for an end user right now (for Whisper). If you're a developer or tester wanting to take a deep dive into Rust code, well then it's a different story :D (For example, I think there's this thing called "repetition penalty", maybe this is not used by candle?)
vvv Agreed :D
Indeed this seems like a bug, sorry about it. That said things won't improve if problems are not reported and bugs are not fixed so I would encourage people to continue trying it - and ideally submit fixes if they find some culprits. The ecosystem is not going to build itself and the only chance for candle (or any other alternative ML framework) to grow is thanks to users investing some of their time making things better.
Hi, thanks to both of you.
I know there are other alternatives, but our codebase is purely Rust, so we'd like to continue with candle
.
As a software engineer myself, I'm well aware that unexpected bugs can arise. I won't blame candle
for this; I know this is a great project, enriching the AI/ML ecosystem for Rust.
I'm currently trying to solve this problem. I've noticed that the candle implementation lacks some logic and ApplyTimestampsRules compared to the Python version. I'm adding these parts, but I'm not sure if this will solve the issue.
A possible workaround of this issue would be using VAD to segment audios to be less than 30 seconds, but it's more like a cover up rather than proper fix so it won't be my first choice, but I might eventually choose this way if the "proper fix" is too complex and time consuming.
Yes, I just played around with VAD preprocessing, it definitely helps with some of these samples :D But music or background noise can trip it up, so it's not a panacea by any means.
Here's how whisper.cpp deals with the repetitions: https://github.com/ggerganov/whisper.cpp/blob/29511d33c76e7cb9f3a353ad02cded7d5bbc4b16/whisper.cpp#L4948-L4969
They seem to be calculating an entropy value for a number of tokens, and compare that against a threshold value.
From my brief testing, switching candle to use this commit (the latest when I was using it) seems to resolve the issue. Perhaps #1424 did the trick?
From my brief testing, switching candle to use this commit (the latest when I was using it) seems to resolve the issue. Perhaps #1424 did the trick?
Hi, I don't think it's been fully fixed. I can see the output has been improved for sample prodrecall_call.wav
and G437-1-segment.wav
, they are perfect now, but for sample long_test.wav
, it became more wired:
0.0s -- 30.0s: Hi this is Tristan how can I help you this afternoon?
30.0s -- 60.0s: I'm so sorry younger. Let me do the living room count and find out what's going on.
60.0s -- 90.0s: Fine. The first fee showed up 3 months ago. If you can't pay it, I'll pay it.
Also, for many other audio samples, it's outputting random transcriptions.
root@3f5389f502e4:/workspaces/candle/candle-examples/examples/whisper# cargo run --example=whisper --release --features=cuda -- --model=large-v2 --input=/workspaces/candle/data/short_test_16k.wav
Finished release [optimized] target(s) in 0.19s
Running `/workspaces/candle/target/release/examples/whisper --model=large-v2 --input=/workspaces/candle/data/short_test_16k.wav`
loaded wav data: Header { audio_format: 1, channel_count: 2, sampling_rate: 16000, bytes_per_second: 64000, bytes_per_sample: 4, bits_per_sample: 16 }
pcm data loaded 198391
loaded mel: [1, 80, 3000]
latin: 0.7516716
javanese: 0.13717812
faroese: 0.025790816
english: 0.023713117
nynorsk: 0.019385228
0.0s -- 30.0s: USA1N8 and YTK so we're now at SR1764.
(Sorry I cannot share this file as it contains confidential information, but the correct transcription should be: "Ok, so I want to open a bank account and my number is [CONFIDENTIAL], and my last three digits of my bank account are 5, 5, 2 ")
BTW, https://github.com/huggingface/candle/pull/1424 just add a library function and its tests, it does not do anything that actually helps solving this bug.
A similar issue was raised just a few months ago, but seems it's still not fully solved now.
failed_audios.zip
I've attached some audio samples in this issue to help our discussion, now
candle
's whisper outputs repeating sentences (sometimes even incomplete words) when the audio file reaches 30s, here are the outputs for the three audio samples I attached:(Repeat
a little bit of
, and trailing.
)(Repeat
I'm so sorry to hear that
, and trailing.com
comes from nowhere)(Repeat
I think that's a good idea.
)I've been struggling with this bug for a long time and I'm still not sure how to fix it. I'd appreciate it if you can give me some suggestions.