turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Simple QuaRot proof of concept. #407

Open sgsdxzy opened 2 months ago

sgsdxzy commented 2 months ago

How to enable QuaRot for weight quantization:

  1. install AutoQuarot pip install git+https://github.com/sgsdxzy/AutoQuarot.git
  2. convert the fp16 model to quarot weights
    
    import transformers
    import auto_quarot

model = transformers.AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16) qrmodel = auto_quarot.AutoQuarotForForCausalLM.from_transformers(model) qrmodel.fuse_layer_norms() qrmodel.rotate_model("hadamard", device=0) qrmodel.model.save_pretrained(rotated_model_path)


3. use exllamav2 to quantize/run the rotated model, it will be recognized automatically.
Ph0rk0z commented 2 months ago

Are the quants limited to 4bit only? Do they get smaller compared to quantizing the regular model? I know the paper said they lose some PPL. This might be helpful for all the huge models released now if so.

sgsdxzy commented 2 months ago

No you can use any quant (AutoGPTQ, AutoAWQ, exl2) with any bpw for weights, after the rotation the models weights still keep their original shape but should be smoother with less outliers. The problem is that the ppl doesn't seem to improve for exl2 quants. It does improve a bit for AutoGPTQ.

sgsdxzy commented 2 months ago

QuaRot of weights does not consistently improve ppl for exl2 quants. QuaRot of kv cache improves ppl for fp8/q4 kv cache.

turboderp commented 2 months ago

Funnily enough I was just working on that. Fused it into the realtime quantization kernels though.