Closed jafioti closed 10 months ago
Yeah, that's unfortunately a little gnarly because both llm
and whisper-rs
use GGML - which is a C library with no function name mangling - so the linker has to pick one of the two conflicting implementations (and I believe whisper's is much older). Honestly, I'm surprised it compiled at all!
I would quite like to see an implementation of whisper in Rust, but it would require someone with more free time than me to do it.
Depending on how badly you need it, you could fork whisper-rs
and whisper.cpp
and rename things so that there are no conflicts, but that's obviously not ideal. For a short-term hacky fix, I'd suggest just breaking out the whisper-rs
code into a separate application or dynamic library to ensure that the linker doesn't see both GGML implementations 😦
Candle has a completely rust native whisper
example, which runs relatively fast. It doesn't support GGML models yet, but that's currently being worked on.
Of course! Do you know if they have any plans to break out the examples into their own libraries?
Well, i actually don't know. I'm currently only focusing on helping a bit with the quantization support. But i guess they wouldn't be unwilling to split it into libraries.
Happy to report that the candle whisper demo works great! Certainly slower than ggml, but still reasonably fast. I'll close this out since it's not really an issue with this crate in particular.
@jafioti Theoretically candle should support quantized ggml tensors since yesterday meaning you probably can recreate the wisper.cpp
with candle as a backend and should get basically the same performance. Currently only q4_0 is supported but i'm planning to port most of the quantization formats over.
@jafioti Theoretically candle should support quantized ggml tensors since yesterday meaning you probably can recreate the
wisper.cpp
with candle as a backend and should get basically the same performance. Currently only q4_0 is supported but i'm planning to port most of the quantization formats over.
Is there an example of using the 4 bit quantization? I'm using candle's llama, but when I set the dtype to u8 I get not implemented errors
Is there an example of using the 4 bit quantization? I'm using candle's llama, but when I set the dtype to u8 I get not implemented errors
Take a look at the qantized llama example. Basically only the matmul opperation supports quantized tensors and will always produce a f32/f16 output meaning your weights are stored in the quantized format, but during inferenze you can use all candle operations as normal. You can create these QTensors
either from a ggml file or from normal f32 tensors by quantizing them.
Hi, I'm trying to use llm on the same project where I'm already using whisper-rs (https://github.com/tazz4843/whisper-rs) and the ggml's for each of the projects seem to be interfering with eachother. Could it be because of the crates looking for the same files, and cargo folding them into the same dependency?
For instance, when I load up a model in llm, I get this error:
thread 'main' panicked at 'called
Result::unwrap()on an
Errvalue: InvariantBroken { path: Some("./models/llama-2-7b-chat.ggmlv3.q4_0.bin"), invariant: "226001103 <= 2" }
When I remove whisper-rs from my project, it compiles and runs fine.
Any ideas how to resolve this? I'd assume I can just rename one of the sys crates, but it doesn't seem to be helping.