How to properly run W8A8?

Hi @ilur98, thanks for your great work on this repository. I am attempting to modify your work to support W8A8 as I found that static W4A8 represents gives too large of a quantization error.

I am running into some trouble when attempting to modify. The quantization works and saves the model. However, when I try to load it, there are problems in the QuantLinear shapes. Do you have any idea how to fix this so that it can run W8A8 with inference mod?

# generate quantized model
python -m dgq.entry \
    TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
    wikitext2 \
    --wt_fun search \
    --act_fun static \
    --groupsize 128 \
    --wbits 8 \
    --kvquant \
    --save_safetensors model_w4w8.safetensors \
    --nsamples 32 \
    --smoothquant

#evaluate quantized model
python -m dgq.entry \
    TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
    wikitext2 \
    --wt_fun search \
    --act_fun static \
    --groupsize 128 \
    --wbits 8 \
    --kvquant \
    --load model_w4w8.safetensors \
    --eval \
    # --inference_mod

ilur98 / DGQ

How to properly run W8A8? #2