ilur98 / DGQ

Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
MIT License
12 stars 1 forks source link

How to properly run W8A8? #2

Open casper-hansen opened 8 months ago

casper-hansen commented 8 months ago

Hi @ilur98, thanks for your great work on this repository. I am attempting to modify your work to support W8A8 as I found that static W4A8 represents gives too large of a quantization error.

I am running into some trouble when attempting to modify. The quantization works and saves the model. However, when I try to load it, there are problems in the QuantLinear shapes. Do you have any idea how to fix this so that it can run W8A8 with inference mod?

# generate quantized model
python -m dgq.entry \
    TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
    wikitext2 \
    --wt_fun search \
    --act_fun static \
    --groupsize 128 \
    --wbits 8 \
    --kvquant \
    --save_safetensors model_w4w8.safetensors \
    --nsamples 32 \
    --smoothquant

#evaluate quantized model
python -m dgq.entry \
    TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
    wikitext2 \
    --wt_fun search \
    --act_fun static \
    --groupsize 128 \
    --wbits 8 \
    --kvquant \
    --load model_w4w8.safetensors \
    --eval \
    # --inference_mod
ilur98 commented 8 months ago

Could you give me the shape of the qweight for your model? This would help me to judge where the problem is. As for INT4 weight, I need to put two INT4 together into a INT8. Maybe this is not suitable for W8 situation.