Closed cyLi-Tiger closed 3 months ago
Hi @cyLi-Tiger , if you would like to evaluate the accuracy alone, you can just use lmquant
to evaluate with the fake-quant model. To evaluate with the wikitext 2 perplexity, you can use command
python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name path_to_my_checkpoints --smooth-xw-alpha 0.3 --smooth-xw-beta 0.7
To evaluate with the zero-shot accuracy, you can use command
python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name path_to_my_checkpoints --smooth-xw-alpha 0.3 --smooth-xw-beta 0.7 --eval-evaluator lm_eval --eval-tasks zero-shot
which will automatically add wikitext, hellaswag, piqa, winogrande, arc_easy, arc_challenge to the evaluation tasks using lm_eval
.
Thanks for your prompt reply! @synxlin
Take ppl evaluation as an example in QoQ, here, kv cache is unused, right? The above scripts you provide can't evaluate the accuracy under kv4. Please correct me if I miss something in your code, thank you again!
Another question, per-group a8w4 with progressive quantization has scales from int8 to int4. what if I want to use per-channel a8w4? How to dequantize weight from int4 to int8 before gemm, cause we convert weight from bf16 to int4 directly and don't have such scale between int4 and int8.
Hi @cyLi-Tiger.
apply_rotary_pos_emb
function in Attention forward here and register the quantization hook to quantize KV cache here. In the Huggingface transformers
package, the KV cache will directly append/concat the results of apply_rotary_pos_emb
for further computation (see here and here).torch.ones
.All my questions are well answered, thanks!
Thanks for your great work!
I use lmquant to generate the checkpoints for Qwen1.5-72b-chat with
python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name path_to_my_checkpoints --smooth-xw-alpha 0.3 --smooth-xw-beta 0.7
and userun_e2e.sh
in qserve to generate tokens, but the result seems wrong.Is there a way that I can do end to end inference without qserve and evaluate the accuracy of the w4a8kv4 algorithm alone?