intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.15k stars 251 forks source link

how to evaluate AWQ ? #1980

Open chunniunai220ml opened 1 month ago

chunniunai220ml commented 1 month ago

https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples

how to set eval_func?

https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_clm_no_trainer.py

it seems no AWQ quantization, just RTN , GPTQ . and as readme.md said, weight-only id fake quantization, why save qmodel (user_model.save(args.output_dir) )?

Kaihui-intel commented 1 month ago

Hello, @chunniunai220ml Thanks for your interest in Intel(R) Neural Compressor. https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples This document describes the 2. x API. 2.x example link is https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm

chunniunai220ml commented 4 weeks ago

Hello, @chunniunai220ml Thanks for your interest in Intel(R) Neural Compressor. https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples This document describes the 2. x API. 2.x example link is https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm

Thank for your reply, i followed 2.x example link , bash script as follow: python -u run_clm_no_trainer.py \ --model $model_path \ --dataset ${DATASET_NAME} \ --approach weight-only \ --output_dir ${tuned_checkpoint} \ --quantize \ --batch_size ${batch_size} \ --woq_algo AWQ \ --calib_iters 128 \ --woq_group_size 128 \ --woq_bits 4 \ --tasks hellaswag \ --accuracy https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py#L355, it seems just evaluate original model instead of qmodel. if i want to evaluate qmodel, can i just modify #L355 as q_model.eval() eval_args = LMEvalParser( model="hf", user_model=q_model, #user_model, tokenizer=tokenizer, batch_size=args.batch_size, tasks=args.tasks,}

as readme.md said, Weight-only quantization based on fake quantization, why save qmodel in #L338? i think the qmodel weights dtype is not INT4 in storage. and the run_clm_no_trainer.py only supprt cpu, where is muti-GPU supported codes?

Kaihui-intel commented 4 weeks ago

sure, the q_model need to export a compressed model https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#export-compressed-model

you can refer to https://github.com/intel/intel-extension-for-transformers/tree/v1.5/examples/huggingface/pytorch/text-generation/quantization v1.5 to quantize int4 model, it has integrated this export compressed model. It also includes GPU scripts.

3.x API is stay-tuned.

chunniunai220ml commented 4 weeks ago

sure, the q_model need to export a compressed model https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#export-compressed-model

you can refer to https://github.com/intel/intel-extension-for-transformers/tree/v1.5/examples/huggingface/pytorch/text-generation/quantization v1.5 to quantize int4 model, it has integrated this export compressed model. It also includes GPU scripts.

3.x API is stay-tuned.

does it works well on nvidia V100? the readme,md seems only describe intel-gpu installation

besides, when run on CPU, it's stranged that the codes always killed for no reason after processing several blocks

Kaihui-intel commented 4 weeks ago

I suggest you try using 3.x api, q_model is the export compressed model.

We will soon update the example of 3. x, which supports detection of auto-device. https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg But we haven't tested the performance on nv GPUs.

on dev branch: https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only

chunniunai220ml commented 4 weeks ago

I suggest you try using 3.x api, q_model is the export compressed model.

We will soon update the example of 3. x, which supports detection of auto-device. https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg But we haven't tested the performance on nv GPUs.

on dev branch: https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only

i git kaihui/woq_3x_eg branch , and run : CUDA_VISIBLE_DEVICES="2" python run_clm_no_trainer.py \ --model $model_path \ --woq_algo AWQ \ --woq_bits 4 \ --woq_group_size 128 \ --calib_iters 128 \ --woq_scheme asym \ --quantize \ --batch_size 1 \ --tasks wikitext \ --accuracy AutoModelForCausalLM.from_pretrained(debice='cuda') neural-compressor/neural_compressor/torch/algorithms/weight_only/awq.py line 240, in block_calibration: model(*args, **kwargs),the inputs device is cpu, so bug reported: : Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

but another bug in eval: from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser File "/*/anaconda3/lib/python3.11/site-packages/intel_extension_for_transformers/transformers/init.py", line 19, in from .config import ( File "/8/anaconda3/lib/python3.11/site-packages/intel_extension_for_transformers/transformers/config.py", line 21, in from neural_compressor.conf.config import ( ModuleNotFoundError: No module named 'neural_compressor.conf'

and, how to load saved_results/quantmodel.pt to evaluate?

pengxin99 commented 2 weeks ago

Hi, @chunniunai220ml, try with the old version like 2.6 may solve this issue: ModuleNotFoundError: No module named 'neural_compressor.conf'