evaluate accuracy - Githubissues

cyLi-Tiger commented 3 months ago

Thanks for your great work!

I use lmquant to generate the checkpoints for Qwen1.5-72b-chat with python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name path_to_my_checkpoints --smooth-xw-alpha 0.3 --smooth-xw-beta 0.7 and use run_e2e.sh in qserve to generate tokens, but the result seems wrong.

Is there a way that I can do end to end inference without qserve and evaluate the accuracy of the w4a8kv4 algorithm alone?

synxlin commented 3 months ago

Hi @cyLi-Tiger , if you would like to evaluate the accuracy alone, you can just use lmquant to evaluate with the fake-quant model. To evaluate with the wikitext 2 perplexity, you can use command

python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name path_to_my_checkpoints --smooth-xw-alpha 0.3 --smooth-xw-beta 0.7

To evaluate with the zero-shot accuracy, you can use command

python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name path_to_my_checkpoints --smooth-xw-alpha 0.3 --smooth-xw-beta 0.7 --eval-evaluator lm_eval --eval-tasks zero-shot

which will automatically add wikitext, hellaswag, piqa, winogrande, arc_easy, arc_challenge to the evaluation tasks using lm_eval.

cyLi-Tiger commented 3 months ago

Thanks for your prompt reply! @synxlin

Take ppl evaluation as an example in QoQ, here, kv cache is unused, right? The above scripts you provide can't evaluate the accuracy under kv4. Please correct me if I miss something in your code, thank you again!

cyLi-Tiger commented 3 months ago

Another question, per-group a8w4 with progressive quantization has scales from int8 to int4. what if I want to use per-channel a8w4? How to dequantize weight from int4 to int8 before gemm, cause we convert weight from bf16 to int4 directly and don't have such scale between int4 and int8.

synxlin commented 3 months ago

Hi @cyLi-Tiger.

We replace the apply_rotary_pos_emb function in Attention forward here and register the quantization hook to quantize KV cache here. In the Huggingface transformers package, the KV cache will directly append/concat the results of apply_rotary_pos_emb for further computation (see here and here).
For wikitext-2 perplexity evaluation here, it is introduced by the prior work GPTQ (see here and widely used in LLM quantization area. It only calculates the error rate first token.
For per-channel W4A8 quantization, we directly extend unsigned INT4 with 4 zeros to become signed INT8 for INT8 computation. You can assume the quantization scale factor from INT4 to INT8 is torch.ones.

cyLi-Tiger commented 3 months ago

All my questions are well answered, thanks!

mit-han-lab / lmquant

evaluate accuracy #4