Open HaileyStorm opened 1 year ago
Hm that's a tricky one. Most likely the hyperparameters of the model diverge from the format defined by llama.cpp
but llm
should be able to handle that.
Could you split of the first ~ 50.000 bytes of the model file and upload them? I'll probably have to look at the hyperparameters of the model file and that way i don't have to download the whole model.
Certainly! Here it is. Thanks for taking a look.
Hey there!
Just filling in some details: our implementation is based on bloomz.cpp
from a few months ago. It's likely that whatever issues it has, we've inherited.
It looks like that quantization comes from a fork, which we'd have to duplicate. A quick look at it suggests that it'll require more engineering work, so it might be some time before we can do anything.
As an aside, it needs a lot of RAM for inference, and the inference times seem incredibly slow. This might be hard to test / may be of limited use, and as far as I know, BLOOM's performance is not fantastic compared to other models.
Is there any particular reason why you'd like to use BLOOM?
After taking a short look at the hyperparameters they look valid. This file was probably created by an older version of ggml, where they didn't adjust the tensor metadata size yet. Which makes this model incompatible with any newer version of ggml.
I personally would like to test its abilities with tool selection and function calling, both things all the 33-65B models I've tried struggle with, at least in terms of consistency .. and eventually I'd like to see how it responds to loras to help with that. I'm also just curious to see its logic abilities, and interested in seeing the open/local community stretch the limits and continue pushing things forward. My interest is also more in BloomChat than Bloomz.
FWIW, This arises from an effort by TheBloke to create quantizations community members can run at home ... there are a number of people on his discord who have expressed interest, and most can't run the GPTQ quant he just made.
You're correct the quant was created using a fork. It includes changing some data types (from int to int64_t or size_t) and increasing RAM allocation. I definitely understand this could be a complex and low priority issue.
Ah ... the fork used to create the quant is older than the code used as basis for this implementation. Which I assume means this isn't an issue for this repo. The good news is, it should be possible to apply the patch to a later version and create a new quant, which will hopefully work with bloomz.cpp and/or this library. Thanks!
No worries! Let us know how you go - I'd love to hear about scientific study of the BLOOM models. I'll admit I'd written them off after hearing about their initial performance - but perhaps they're hiding some secrets yet 👀
I'll leave this issue open so that people know what to expect if they try using models from that fork.
Well, we'll see, maybe in the end it's worth writing off ;) Would like to give it a thorough try though. Will keep you posted ... TheBloke indicated he'd try to get a quant from patched newer version tomorrow, so if that goes smoothly we'll have an update soon.
Have similar issues with ggml-vicuna-13b-4bit.bin
model. I downloaded it quite a while ago, around a time when Vicuna was first announced.
Loaded hyperparameters
thread 'main' panicked at 'Failed to load LLaMA model from "../llama.cpp/models/ggml-vicuna-13b-4bit.bin": invariant broken: 1007030857 <= 2 in Some("../llama.cpp/models/ggml-vicuna-13b-4bit.bin")', crates/llm/examples/vicuna-chat.rs:41:9
Have similar issues with
ggml-vicuna-13b-4bit.bin
model. I downloaded it quite a while ago, around a time when Vicuna was first announced.Loaded hyperparameters thread 'main' panicked at 'Failed to load LLaMA model from "../llama.cpp/models/ggml-vicuna-13b-4bit.bin": invariant broken: 1007030857 <= 2 in Some("../llama.cpp/models/ggml-vicuna-13b-4bit.bin")', crates/llm/examples/vicuna-chat.rs:41:9
Unfortunately, there's been quite a few format breaks since then. That should be caught with a meaningful error, but some models predate that versioning scheme. You'll have to download a more recent model - I'd suggest something from TheBloke's collection, which keeps up to date with GGML/llama.cpp/us.
I have a q4_0 quant of bloomz-176B (created with bloomz.cpp). As with bloomz.cpp, inference on this model is broken. For example, this command: llm infer -a bloom -m bloomz_q4_0.bin -p "A short story about llamas:"
Gives the error:
invariant broken: 990305134 <= 2 in Some("bloomz_q4_0.bin")
This might the same thing preventing bloomz.cpp from doing the inference (it works, but seems to skip tokens, more and more as it goes along, see https://github.com/NouamaneTazi/bloomz.cpp/issues/14#issuecomment-1627486803). But it's not clear exactly what the problem is there.