turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.52k stars 271 forks source link

There is a problem when quantizing Qwen2 #509

Closed 1Q18LAqakl closed 2 months ago

1Q18LAqakl commented 3 months ago

There is a problem when quantizing Qwen2, the video memory occupies 80G is not enough, it will still show memory overflow, please tell me what to do, is there a problem with the code, or is it true that qwen needs to occupy so much memory? If yes, can you tell me how much video memory is needed to quantize qwen2 normally. (Sorry I don't speak English well, so I use a machine translation.)

turboderp commented 3 months ago

It doesn't need a lot of video memory, no. Qwen2-72B shouldn't require more than 24 GB of VRAM.

It needs some amount of system memory, though. Also not a huge amount but it could be more than 32 GB.

Could you clarify which model you're trying to convert and a few more details about the hardware?

1Q18LAqakl commented 3 months ago

I'm sorry, what I want to say is the system memory, but I don't know how to express it, I tested, qwen2 quantizes exl2 needs 100G+ system memory, is it normal to occupy such a large system memory?

1Q18LAqakl commented 3 months ago

What I want to convert is Qwen_Qwen2-72B-Instruct to 6.0bpw

turboderp commented 3 months ago

100 GB+ is not normal, no. Is this on Windows? It might be a safetensors issue, if that's the case. Or maybe you're using an extremely large amount of calibration data?

There's an EXL2 version here, by the way.

1Q18LAqakl commented 3 months ago

I used 999 lines of calibration data on windows, probably that's why I need so much memory, thank you for your help, and I would like to ask, does the calibration data have a corrective effect on the ethical review of the model?

turboderp commented 3 months ago

The default, builtin dataset has 115 rows for quantization and 19 for measurement. So 999 is a very large dataset.

The calibration dataset does not directly affect the alignment of the model. It's not a finetuning method. It provides reference points to determine which features are more important (and thus need higher precision) and to reveal redundancies.

Overall I don't recommend using a custom calibration set unless you're experimenting with quantization on a technical level, for instance if you're trying to push towards the 2bpw limit in a very narrow domain.

1Q18LAqakl commented 3 months ago

Well, I'm just a rookie, I want to circumvent censorship by calibrating the dataset, I think the more datasets there are, the greater the chance of circumventing model censorship, it seems that I failed. Anyway, thank you for your help.

turboderp commented 3 months ago

The best approach to decensoring right now is probably orthogonalization (the idea that refusal is mostly represented as a single direction in the model's latent space, which can be suppressed). There's a popular version of Llama3-70B (and EXL2 6.0bpw here) with the suppression baked in, but I haven't found anything so far for Qwen2-72B. I would be on the lookout for that.

I'm also hoping to add something similar to ExLlamaV2 that would apply to all models at inference time, but I'm not sure about the feasibility yet. Soon. Maybe.

1Q18LAqakl commented 3 months ago

Well, thank you, brother, I wish you a happy life!