rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models
https://docs.rs/llm/latest/llm/
Apache License 2.0
6.06k stars 350 forks source link

Issues using with whisper-rs #408

Closed jafioti closed 10 months ago

jafioti commented 10 months ago

Hi, I'm trying to use llm on the same project where I'm already using whisper-rs (https://github.com/tazz4843/whisper-rs) and the ggml's for each of the projects seem to be interfering with eachother. Could it be because of the crates looking for the same files, and cargo folding them into the same dependency?

For instance, when I load up a model in llm, I get this error: thread 'main' panicked at 'calledResult::unwrap()on anErrvalue: InvariantBroken { path: Some("./models/llama-2-7b-chat.ggmlv3.q4_0.bin"), invariant: "226001103 <= 2" }

When I remove whisper-rs from my project, it compiles and runs fine.

Any ideas how to resolve this? I'd assume I can just rename one of the sys crates, but it doesn't seem to be helping.

philpax commented 10 months ago

Yeah, that's unfortunately a little gnarly because both llm and whisper-rs use GGML - which is a C library with no function name mangling - so the linker has to pick one of the two conflicting implementations (and I believe whisper's is much older). Honestly, I'm surprised it compiled at all!

I would quite like to see an implementation of whisper in Rust, but it would require someone with more free time than me to do it.

Depending on how badly you need it, you could fork whisper-rs and whisper.cpp and rename things so that there are no conflicts, but that's obviously not ideal. For a short-term hacky fix, I'd suggest just breaking out the whisper-rs code into a separate application or dynamic library to ensure that the linker doesn't see both GGML implementations 😦

LLukas22 commented 10 months ago

Candle has a completely rust native whisper example, which runs relatively fast. It doesn't support GGML models yet, but that's currently being worked on.

philpax commented 10 months ago

Of course! Do you know if they have any plans to break out the examples into their own libraries?

LLukas22 commented 10 months ago

Well, i actually don't know. I'm currently only focusing on helping a bit with the quantization support. But i guess they wouldn't be unwilling to split it into libraries.

jafioti commented 10 months ago

Happy to report that the candle whisper demo works great! Certainly slower than ggml, but still reasonably fast. I'll close this out since it's not really an issue with this crate in particular.

LLukas22 commented 10 months ago

@jafioti Theoretically candle should support quantized ggml tensors since yesterday meaning you probably can recreate the wisper.cpp with candle as a backend and should get basically the same performance. Currently only q4_0 is supported but i'm planning to port most of the quantization formats over.

jafioti commented 10 months ago

@jafioti Theoretically candle should support quantized ggml tensors since yesterday meaning you probably can recreate the wisper.cpp with candle as a backend and should get basically the same performance. Currently only q4_0 is supported but i'm planning to port most of the quantization formats over.

Is there an example of using the 4 bit quantization? I'm using candle's llama, but when I set the dtype to u8 I get not implemented errors

LLukas22 commented 10 months ago

Is there an example of using the 4 bit quantization? I'm using candle's llama, but when I set the dtype to u8 I get not implemented errors

Take a look at the qantized llama example. Basically only the matmul opperation supports quantized tensors and will always produce a f32/f16 output meaning your weights are stored in the quantized format, but during inferenze you can use all candle operations as normal. You can create these QTensors either from a ggml file or from normal f32 tensors by quantizing them.