huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.14k stars 887 forks source link

Whisper outputs repeating sentences #1422

Open WenqingZong opened 9 months ago

WenqingZong commented 9 months ago

A similar issue was raised just a few months ago, but seems it's still not fully solved now.

failed_audios.zip

I've attached some audio samples in this issue to help our discussion, now candle's whisper outputs repeating sentences (sometimes even incomplete words) when the audio file reaches 30s, here are the outputs for the three audio samples I attached:

(Repeat a little bit of, and trailing .)

# cargo run --example whisper --release --features=cuda -- --input=prodrecall_call.wav --timestamps
   Compiling candle-examples v0.3.1 (/workspaces/candle/candle-examples)
    Finished release [optimized] target(s) in 5.61s
     Running `target/release/examples/whisper --input=prodrecall_call.wav --timestamps`
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 1913152
loaded mel: [1, 80, 13500]
0.0s -- 30.0s
  0.0s-...: . I'm sorry. I can definitely help you with that. What is your first and last name? Sure. Steve Simmons. Okay. Can you spell that for me? S. P. E. V. E. Last name is
30.0s -- 60.0s
  0.0s-...: .com. Okay. Can I also have your email address? Sure. It's Steve Simmons, just like it's spelled at gmail.com. Okay. I have Steve Simmons, S-P-E-V-E-S-I-M-M-O-N-S at tmail.com. Yes. Okay. Great. I'm going to have a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little
60.0s -- 90.0s
  0.0s-...: . I have located UNR system and you purchased the model number 23087, which has been recalled. I'm going to email you a pre-page shipping label to send back the recalled blender and we will be mailing you our latest model. Oh, thank you very much. So, go on the system and I just have your mailing address. Sure, it's 800 North Henderson Road in King of Prussia, Pennsylvania and the zip is one nine four of cents. Okay. I have a
90.0s -- 120.0s
  0.0s-...: . I'm going to take the new blender. So you can expect your new blender to arrive in about four to six business fees. Oh great. Thank you very much. No problem. Is there anything else I could help you with today? No, that's all I needed, but thank you. Thank you. Have a nice day. You too. Bye bye.
120.0s -- 135.0s
  0.0s-...: .

(Repeat I'm so sorry to hear that, and trailing .com comes from nowhere)

# cargo run --example whisper --release --features=cuda -- --input=long_test.wav --timestamps
    Finished release [optimized] target(s) in 0.22s
     Running `target/release/examples/whisper --input=long_test.wav --timestamps`
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 1081864
loaded mel: [1, 80, 9000]
0.0s -- 30.0s
  0.0s-...: . Hi, this is Jason. How can I help you this afternoon? You tell me I'm looking at my new statement and I see that you people have charged me again for the same fees I've been calling about for months now. I'm sick and tired of being told that it's handled one pretty clearly it's not. I'm so sorry to hear that. Let me take a look at your account and find out what's going on. Can you remind me when this started? I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to hear that. I'm so sorry to
30.0s -- 60.0s
  0.0s-...: . I completely understand your frustration. Let me see what I can do to help get this taken care of for you right now. I'm glad I was able to get that issue taken care of. It may take up to 24 hours for your account to show the changes, but you shouldn't see any more of these fees in the future. Thanks for working with me to figure it out. I want to apologize again for the inconvenience. Thanks Jason, I appreciate your help.
60.0s -- 90.0s
  0.0s-...: .com

(Repeat I think that's a good idea.)

# cargo run --example whisper --release --features=cuda -- --input=G437-1-segment.wav --timestamps
    Finished release [optimized] target(s) in 0.22s
     Running `target/release/examples/whisper --input=G437-1-segment.wav --timestamps`
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 1911376
loaded mel: [1, 80, 13500]
0.0s -- 30.0s
  0.0s-29.0s: . Okay, so there's two times. The most common one is when you go through a yellow light just about to turn red or maybe did turn red, you tap the roof of your car and you're safe. I don't know. And then the second one is if you're speeding or you're just being a bad person and you see a cop and for whatever reason the cop is a bad thing for you, you tap the roof of your car and everybody in the car has to tap the roof of the car.
30.0s -- 60.0s
  0.0s-...: . I don't do it because I think it's stupid. But my friends have done it and maybe is that just a Colorado thing or something? I don't know. I could just be me. I just haven't done it. Well leave a comment below if you've ever heard of this one before. Dropping on the floor, Michael. Okay. So yeah, what's on your
60.0s -- 90.0s
  0.0s-...: . I also have a theory about this one. The theory about this one is the time frame, the seven year time frame that's the sign that's breaking a mirror. Because, and I could be totally wrong on this, but my theory is that if you break a mirror, a mirror is glass, a mirror, you know, it's a shatter into, you know, millions of pieces. And maybe it's hard to find all of the pieces, maybe cleaning the mess of going to be difficult. So, I think that's a really good idea. I think that's a really good idea. I think that's a really good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's a good idea. I think that's
90.0s -- 120.0s
  0.0s-...: . And then when you tell every night's glass and you're like, see you told yourself. I don't know, that's my theory on that one. But again, I think this is another instant, instant bad luck one. As soon as you break the mirror, you know, to clean it up, somebody could get cut, whatever.
120.0s -- 135.0s

I've been struggling with this bug for a long time and I'm still not sure how to fix it. I'd appreciate it if you can give me some suggestions.

misutoneko commented 9 months ago

Hi,

Thank you for including the actual samples with your report. I was able to reproduce this exact behavior on my own CPU-only setup. Yup, it's a bug. (And no, I don't have a fix.)

Frankly, my recommendation is to forget about candle and use one of the pythonesque Whisper variants instead, if you can. I've had good results with whisper-timestamped and whisper.cpp, for example. Btw, whisper-timestamped also has trouble with these samples without the --suppress-tokens "" trick, but it's not exactly the same (because of different codebase).

If you really, really want to use candle, you might want to consider using a bigger (and quantized?) model. VAD preprocessing might also help some.

NB. Don't take this as a criticism against candle, it's a fantastic project but imho it's not the best choice for an end user right now (for Whisper). If you're a developer or tester wanting to take a deep dive into Rust code, well then it's a different story :D (For example, I think there's this thing called "repetition penalty", maybe this is not used by candle?)

vvv Agreed :D

LaurentMazare commented 9 months ago

Indeed this seems like a bug, sorry about it. That said things won't improve if problems are not reported and bugs are not fixed so I would encourage people to continue trying it - and ideally submit fixes if they find some culprits. The ecosystem is not going to build itself and the only chance for candle (or any other alternative ML framework) to grow is thanks to users investing some of their time making things better.

WenqingZong commented 9 months ago

Hi, thanks to both of you.

I know there are other alternatives, but our codebase is purely Rust, so we'd like to continue with candle.

As a software engineer myself, I'm well aware that unexpected bugs can arise. I won't blame candle for this; I know this is a great project, enriching the AI/ML ecosystem for Rust.

I'm currently trying to solve this problem. I've noticed that the candle implementation lacks some logic and ApplyTimestampsRules compared to the Python version. I'm adding these parts, but I'm not sure if this will solve the issue.

WenqingZong commented 9 months ago

A possible workaround of this issue would be using VAD to segment audios to be less than 30 seconds, but it's more like a cover up rather than proper fix so it won't be my first choice, but I might eventually choose this way if the "proper fix" is too complex and time consuming.

misutoneko commented 9 months ago

Yes, I just played around with VAD preprocessing, it definitely helps with some of these samples :D But music or background noise can trip it up, so it's not a panacea by any means.

misutoneko commented 9 months ago

Here's how whisper.cpp deals with the repetitions: https://github.com/ggerganov/whisper.cpp/blob/29511d33c76e7cb9f3a353ad02cded7d5bbc4b16/whisper.cpp#L4948-L4969

They seem to be calculating an entropy value for a number of tokens, and compare that against a threshold value.

NatanFreeman commented 8 months ago

From my brief testing, switching candle to use this commit (the latest when I was using it) seems to resolve the issue. Perhaps #1424 did the trick?

WenqingZong commented 8 months ago

From my brief testing, switching candle to use this commit (the latest when I was using it) seems to resolve the issue. Perhaps #1424 did the trick?

Hi, I don't think it's been fully fixed. I can see the output has been improved for sample prodrecall_call.wav and G437-1-segment.wav, they are perfect now, but for sample long_test.wav, it became more wired:

0.0s -- 30.0s:  Hi this is Tristan how can I help you this afternoon?
30.0s -- 60.0s:  I'm so sorry younger. Let me do the living room count and find out what's going on.
60.0s -- 90.0s:  Fine. The first fee showed up 3 months ago. If you can't pay it, I'll pay it.

Also, for many other audio samples, it's outputting random transcriptions.

root@3f5389f502e4:/workspaces/candle/candle-examples/examples/whisper# cargo run --example=whisper --release --features=cuda -- --model=large-v2 --input=/workspaces/candle/data/short_test_16k.wav
    Finished release [optimized] target(s) in 0.19s
     Running `/workspaces/candle/target/release/examples/whisper --model=large-v2 --input=/workspaces/candle/data/short_test_16k.wav`
loaded wav data: Header { audio_format: 1, channel_count: 2, sampling_rate: 16000, bytes_per_second: 64000, bytes_per_sample: 4, bits_per_sample: 16 }
pcm data loaded 198391
loaded mel: [1, 80, 3000]
latin: 0.7516716
javanese: 0.13717812
faroese: 0.025790816
english: 0.023713117
nynorsk: 0.019385228
0.0s -- 30.0s:  USA1N8 and YTK so we're now at SR1764.

(Sorry I cannot share this file as it contains confidential information, but the correct transcription should be: "Ok, so I want to open a bank account and my number is [CONFIDENTIAL], and my last three digits of my bank account are 5, 5, 2 ")

BTW, https://github.com/huggingface/candle/pull/1424 just add a library function and its tests, it does not do anything that actually helps solving this bug.