Open sgsdxzy opened 2 months ago
Are the quants limited to 4bit only? Do they get smaller compared to quantizing the regular model? I know the paper said they lose some PPL. This might be helpful for all the huge models released now if so.
No you can use any quant (AutoGPTQ, AutoAWQ, exl2) with any bpw for weights, after the rotation the models weights still keep their original shape but should be smoother with less outliers. The problem is that the ppl doesn't seem to improve for exl2 quants. It does improve a bit for AutoGPTQ.
QuaRot of weights does not consistently improve ppl for exl2 quants. QuaRot of kv cache improves ppl for fp8/q4 kv cache.
Funnily enough I was just working on that. Fused it into the realtime quantization kernels though.
How to enable QuaRot for weight quantization:
pip install git+https://github.com/sgsdxzy/AutoQuarot.git
model = transformers.AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16) qrmodel = auto_quarot.AutoQuarotForForCausalLM.from_transformers(model) qrmodel.fuse_layer_norms() qrmodel.rotate_model("hadamard", device=0) qrmodel.model.save_pretrained(rotated_model_path)