Closed 1Q18LAqakl closed 2 months ago
It doesn't need a lot of video memory, no. Qwen2-72B shouldn't require more than 24 GB of VRAM.
It needs some amount of system memory, though. Also not a huge amount but it could be more than 32 GB.
Could you clarify which model you're trying to convert and a few more details about the hardware?
I'm sorry, what I want to say is the system memory, but I don't know how to express it, I tested, qwen2 quantizes exl2 needs 100G+ system memory, is it normal to occupy such a large system memory?
What I want to convert is Qwen_Qwen2-72B-Instruct to 6.0bpw
100 GB+ is not normal, no. Is this on Windows? It might be a safetensors issue, if that's the case. Or maybe you're using an extremely large amount of calibration data?
There's an EXL2 version here, by the way.
I used 999 lines of calibration data on windows, probably that's why I need so much memory, thank you for your help, and I would like to ask, does the calibration data have a corrective effect on the ethical review of the model?
The default, builtin dataset has 115 rows for quantization and 19 for measurement. So 999 is a very large dataset.
The calibration dataset does not directly affect the alignment of the model. It's not a finetuning method. It provides reference points to determine which features are more important (and thus need higher precision) and to reveal redundancies.
Overall I don't recommend using a custom calibration set unless you're experimenting with quantization on a technical level, for instance if you're trying to push towards the 2bpw limit in a very narrow domain.
Well, I'm just a rookie, I want to circumvent censorship by calibrating the dataset, I think the more datasets there are, the greater the chance of circumventing model censorship, it seems that I failed. Anyway, thank you for your help.
The best approach to decensoring right now is probably orthogonalization (the idea that refusal is mostly represented as a single direction in the model's latent space, which can be suppressed). There's a popular version of Llama3-70B (and EXL2 6.0bpw here) with the suppression baked in, but I haven't found anything so far for Qwen2-72B. I would be on the lookout for that.
I'm also hoping to add something similar to ExLlamaV2 that would apply to all models at inference time, but I'm not sure about the feasibility yet. Soon. Maybe.
Well, thank you, brother, I wish you a happy life!
There is a problem when quantizing Qwen2, the video memory occupies 80G is not enough, it will still show memory overflow, please tell me what to do, is there a problem with the code, or is it true that qwen needs to occupy so much memory? If yes, can you tell me how much video memory is needed to quantize qwen2 normally. (Sorry I don't speak English well, so I use a machine translation.)