turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.6k stars 278 forks source link

[BUG] Quantization of Qwen return garbage #621

Open fahadh4ilyas opened 1 month ago

fahadh4ilyas commented 1 month ago

OS

Linux

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

2.4.0

Model

No response

Describe the bug

I quantize my own qwen 7B model and the return token is always 60021. Here is config file of my qwen model

{
  "_name_or_path": "models/qwen2-7B",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

Reproduction steps

Here is my parameter to quantize

python convert.py -i models/myQwen-7B-HF -o models/myQwen-7B-EXL2/ -b 8 -hb 8 -l 16384 -ml 16384

Expected behavior

The generation after quantization is working.

Logs

No response

Additional context

No response

Acknowledgements

fahadh4ilyas commented 1 month ago

Additional Note: After I tried to quantize qwen model made by the developer Qwen/Qwen2-7B-Instruct, the result is also garbage. So, the problem is not from my model.

turboderp commented 1 month ago

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.

Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine:

Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16
fahadh4ilyas commented 1 month ago

Could you test it using my exllamav2_hf here at #606 ? Because I kept getting garbage answers inferencing using that. But, when inferencing another model, it works just fine.

turboderp commented 1 month ago

I looked into it and managed to reproduce the problem with the HF wrapper code.

It seems the issue is with attention. Since you're supplying a mask, that disables both the Flash Attention and SDPA code paths as they only support causal attention. ExLlama falls back on matmul attention, which works in half precision, which isn't an issue for most models, but for Qwen2-7B specifically you get occasional overflows on some layers and then inference breaks. I think it's to do with weird normalization of the keys/queries, related to why it doesn't like the Q4 cache mode.

Regardless, SDPA can use an arbitrary mask, Torch just won't use efficient kernels internally, but it should avoid the overflows anyway. I've enabled that in the latest commit on the dev branch and it seems to be working with your wrapper.

fahadh4ilyas commented 1 month ago

wait, do you mean that if I set the input_mask parameter, flash attention wouldn't be used? Then how to generate a batch of texts without input_mask?

turboderp commented 1 month ago

Yes, flash-attn doesn't support input masks. And if you want to supply a rectangular input IDs tensor where the rows aren't the same length, the only way that can happen is with masking. Otherwise you'd have to start with the shortest input, generate at a batch size of 1 until it reaches the same length as the 2nd shortest input, then at bsz 2, etc. That way the input is always rectangular and you won't have to mask out any padding, but it's very inefficient.

flash-attn does have a "varlen" mode, but it's not efficient either since it requires the cache to be contiguous, so you have to constantly rebuild it (copy the whole thing in VRAM) to make space for new keys/values for every token generated.

The alternative is to use paged attention with a flat cache. This however is only compatible with flash-attn.

The SDPA approach at least allows for Torch to switch to a more efficient backend at some later time if flash-attn ever supports masking. There has been some work on this with https://github.com/Dao-AILab/flash-attention/pull/617, but it's not finished yet, apparently.

Thireus commented 1 month ago

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.

Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine: Model Quant Cache pass@1 pass@10 Wikitext 5x1k Qwen2-7B FP16 Q4 19.74% 46.34% 40.72 Qwen2-7B FP16 Q6 61.65% 81.70% 15.20 Qwen2-7B FP16 Q8 62.37% 81.09% 15.18 Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@turboderp, would you have similar metrics for other models?

DocShotgun commented 1 month ago

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here. Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine: Model Quant Cache pass@1 pass@10 Wikitext 5x1k Qwen2-7B FP16 Q4 19.74% 46.34% 40.72 Qwen2-7B FP16 Q6 61.65% 81.70% 15.20 Qwen2-7B FP16 Q8 62.37% 81.09% 15.18 Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@turboderp, would you have similar metrics for other models?

https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

Downtown-Case commented 1 month ago

@Thireus I tested the 2024 Command-R here:

https://old.reddit.com/r/LocalLLaMA/comments/1f6ijye/commandr_35b_q4q6q8_cache_perplexity_mmlu/

And this is a model with extremely "compressed" attention where 110K of Q4 context only takes like 4GB. I think Qwen2 was just a crazy extreme, and the new Qwen 2.5 doesn't behave like that.