turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.52k stars 271 forks source link

Support MiniCPM architecture #479

Closed meigami0 closed 3 months ago

meigami0 commented 4 months ago

First of all, thank you for your work. I've been able to efficiently run many models locally using exllamav2, which is a highly efficient inference library. Recently, I tried to use exllamav2-0.0.21 for inference and quantization on the MiniCPM-2B-128k model. While the model can be loaded and run normally, it only returns special tokens or repeating content instead of generating coherent text. I'm unsure if the library supports models with this architecture. I would greatly appreciate any assistance or guidance you can provide regarding this issue.

exllamav2 Version: 0.0.21 Model: MiniCPM-2B-128k (https://github.com/OpenBMB/MiniCPM)

turboderp commented 4 months ago

There's no explicit support for that architecture, so you'll have been using the fallback mode which is Llama. It looks like another one of those models that's just Llama with a tiny change to make it incompatible. Here it looks like the residual connections are scaled by 1.4/sqrt(40) and the logits are scaled by 1/9.

It's a small model, though, so I'll give it a look in a little bit to see if that's all it needs.

turboderp commented 4 months ago

I'm a little conflicted and kinda want to boycott models like this. It comes down to a few redundant scaling parameters that could have been baked into the weights, making this just another rebranded Llama architecture. I did add it, though, with the latest commit in the dev branch.

meigami0 commented 3 months ago

Thank you for your response and update.When trying version 0.1.3, here's the feedback: The FP16 version of the model can be loaded and run inference normally in FP16 and FP8 cache modes. However, in Q4 mode, it throws an error: "AssertionError: Cannot create Q4 cache. num_key_value_heads head_dim does not split into blocks of 512". This seems to be because 36 64 = 2304 is not a multiple of 512. After quantizing the model using convert.py, it cannot run inference properly (returning special tokens or repeating content).This is the conversion script: python .\convert.py -i "MiniCPM-2B-128k_safetensors" -o "tmp" -nr -om "measurement.json" python .\convert.py -i "MiniCPM-2B-128k_safetensors" -o "tmp" -nr -m "measurement.json" -cf "6bpw" -b 6 The -rs 4 parameter also tried , but it doesn't seem to have any effect.

turboderp commented 3 months ago

The Q4 issue is indeed because the k/v dimension isn't a multiple of the quant blocksize. The latest commit (dev branch) should address this (and allow any dimension to work in theory).

I'm not sure what's up with your quantized version, though. It seems to be working fine here. I've uploaded a pair of quants here you can maybe compare to.

Model wiki2 ppl
fp16 7.7215
4.0bpw 7.8274
6.0bpw 7.7374

One thing to note is that this is one of those models that needs a BOS token or it will just output gibberish:

Hello, my name is ae x', 50000000000000000000000000 500000000000 5000000050512222222222222222222222222222222222222222222222222222222222222222222222
<s> Hello, my name is Myra, my ideal match is someone who likes to explore, but also has a gentle and calm disposition. I like to get to know someone before getting into a relationship. I am looking for someone who wants to have a relationship that has a future and not just be someone's one night stand. I have no kids, and no pets. I am currently working on getting a job. I have a cat named Salem, but he does not live with me. He is a rescue from a shelter in Austin. He is sweet and playful, but sometimes lazy. He does like to play with the cat toys that I give him

However this should be the same behavior regardless of quantization. Possibly your quants are somehow broken, even though the command lines for converting look correct, so I'm not sure what gives. Try to compare to one the models I uploaded, or maybe elaborate on how you're getting the broken output (I've only done some limited testing.)

meigami0 commented 3 months ago

Thanks for the feedback. I re-prepared the model in safetensors format, then tried quantizing it using the measurement.json you uploaded, and this time the result was normal. Next, I regenerated the measurement.json locally and quantized it again, and the result was still normal. Therefore, the issue was likely with the previously converted safetensors format model, even though it could be loaded and run inference normally. In any case, thank you for your patient guidance!