rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models
https://docs.rs/llm/latest/llm/
Apache License 2.0
6.08k stars 363 forks source link

Bloom 176B inference is broken #361

Open HaileyStorm opened 1 year ago

HaileyStorm commented 1 year ago

I have a q4_0 quant of bloomz-176B (created with bloomz.cpp). As with bloomz.cpp, inference on this model is broken. For example, this command: llm infer -a bloom -m bloomz_q4_0.bin -p "A short story about llamas:"

Gives the error:

invariant broken: 990305134 <= 2 in Some("bloomz_q4_0.bin")

This might the same thing preventing bloomz.cpp from doing the inference (it works, but seems to skip tokens, more and more as it goes along, see https://github.com/NouamaneTazi/bloomz.cpp/issues/14#issuecomment-1627486803). But it's not clear exactly what the problem is there.

LLukas22 commented 1 year ago

Hm that's a tricky one. Most likely the hyperparameters of the model diverge from the format defined by llama.cpp but llm should be able to handle that.

Could you split of the first ~ 50.000 bytes of the model file and upload them? I'll probably have to look at the hyperparameters of the model file and that way i don't have to download the whole model.

HaileyStorm commented 1 year ago

bloomz_q4_0_50kb.zip

Certainly! Here it is. Thanks for taking a look.

philpax commented 1 year ago

Hey there!

Just filling in some details: our implementation is based on bloomz.cpp from a few months ago. It's likely that whatever issues it has, we've inherited.

It looks like that quantization comes from a fork, which we'd have to duplicate. A quick look at it suggests that it'll require more engineering work, so it might be some time before we can do anything.


As an aside, it needs a lot of RAM for inference, and the inference times seem incredibly slow. This might be hard to test / may be of limited use, and as far as I know, BLOOM's performance is not fantastic compared to other models.

Is there any particular reason why you'd like to use BLOOM?

LLukas22 commented 1 year ago

After taking a short look at the hyperparameters they look valid. This file was probably created by an older version of ggml, where they didn't adjust the tensor metadata size yet. Which makes this model incompatible with any newer version of ggml.

HaileyStorm commented 1 year ago

I personally would like to test its abilities with tool selection and function calling, both things all the 33-65B models I've tried struggle with, at least in terms of consistency .. and eventually I'd like to see how it responds to loras to help with that. I'm also just curious to see its logic abilities, and interested in seeing the open/local community stretch the limits and continue pushing things forward. My interest is also more in BloomChat than Bloomz.

FWIW, This arises from an effort by TheBloke to create quantizations community members can run at home ... there are a number of people on his discord who have expressed interest, and most can't run the GPTQ quant he just made.

You're correct the quant was created using a fork. It includes changing some data types (from int to int64_t or size_t) and increasing RAM allocation. I definitely understand this could be a complex and low priority issue.

HaileyStorm commented 1 year ago

Ah ... the fork used to create the quant is older than the code used as basis for this implementation. Which I assume means this isn't an issue for this repo. The good news is, it should be possible to apply the patch to a later version and create a new quant, which will hopefully work with bloomz.cpp and/or this library. Thanks!

philpax commented 1 year ago

No worries! Let us know how you go - I'd love to hear about scientific study of the BLOOM models. I'll admit I'd written them off after hearing about their initial performance - but perhaps they're hiding some secrets yet 👀

I'll leave this issue open so that people know what to expect if they try using models from that fork.

HaileyStorm commented 1 year ago

Well, we'll see, maybe in the end it's worth writing off ;) Would like to give it a thorough try though. Will keep you posted ... TheBloke indicated he'd try to get a quant from patched newer version tomorrow, so if that goes smoothly we'll have an update soon.

0x7CFE commented 1 year ago

Have similar issues with ggml-vicuna-13b-4bit.bin model. I downloaded it quite a while ago, around a time when Vicuna was first announced.

Loaded hyperparameters
thread 'main' panicked at 'Failed to load LLaMA model from "../llama.cpp/models/ggml-vicuna-13b-4bit.bin": invariant broken: 1007030857 <= 2 in Some("../llama.cpp/models/ggml-vicuna-13b-4bit.bin")', crates/llm/examples/vicuna-chat.rs:41:9
philpax commented 1 year ago

Have similar issues with ggml-vicuna-13b-4bit.bin model. I downloaded it quite a while ago, around a time when Vicuna was first announced.

Loaded hyperparameters
thread 'main' panicked at 'Failed to load LLaMA model from "../llama.cpp/models/ggml-vicuna-13b-4bit.bin": invariant broken: 1007030857 <= 2 in Some("../llama.cpp/models/ggml-vicuna-13b-4bit.bin")', crates/llm/examples/vicuna-chat.rs:41:9

Unfortunately, there's been quite a few format breaks since then. That should be caught with a meaningful error, but some models predate that versioning scheme. You'll have to download a more recent model - I'd suggest something from TheBloke's collection, which keeps up to date with GGML/llama.cpp/us.