Open chunniunai220ml opened 1 month ago
Hello, @chunniunai220ml Thanks for your interest in Intel(R) Neural Compressor. https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples This document describes the 2. x API. 2.x example link is https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm
Hello, @chunniunai220ml Thanks for your interest in Intel(R) Neural Compressor. https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples This document describes the 2. x API. 2.x example link is https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm
Thank for your reply, i followed 2.x example link , bash script as follow: python -u run_clm_no_trainer.py \ --model $model_path \ --dataset ${DATASET_NAME} \ --approach weight-only \ --output_dir ${tuned_checkpoint} \ --quantize \ --batch_size ${batch_size} \ --woq_algo AWQ \ --calib_iters 128 \ --woq_group_size 128 \ --woq_bits 4 \ --tasks hellaswag \ --accuracy https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py#L355, it seems just evaluate original model instead of qmodel. if i want to evaluate qmodel, can i just modify #L355 as q_model.eval() eval_args = LMEvalParser( model="hf", user_model=q_model, #user_model, tokenizer=tokenizer, batch_size=args.batch_size, tasks=args.tasks,}
as readme.md said, Weight-only quantization based on fake quantization, why save qmodel in #L338? i think the qmodel weights dtype is not INT4 in storage. and the run_clm_no_trainer.py only supprt cpu, where is muti-GPU supported codes?
sure, the q_model need to export a compressed model https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#export-compressed-model
you can refer to https://github.com/intel/intel-extension-for-transformers/tree/v1.5/examples/huggingface/pytorch/text-generation/quantization v1.5 to quantize int4 model, it has integrated this export compressed model. It also includes GPU scripts.
3.x API is stay-tuned.
sure, the q_model need to export a compressed model https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#export-compressed-model
you can refer to https://github.com/intel/intel-extension-for-transformers/tree/v1.5/examples/huggingface/pytorch/text-generation/quantization v1.5 to quantize int4 model, it has integrated this export compressed model. It also includes GPU scripts.
3.x API is stay-tuned.
does it works well on nvidia V100? the readme,md seems only describe intel-gpu installation
besides, when run on CPU, it's stranged that the codes always killed for no reason after processing several blocks
I suggest you try using 3.x api, q_model is the export compressed model.
We will soon update the example of 3. x, which supports detection of auto-device. https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg But we haven't tested the performance on nv GPUs.
I suggest you try using 3.x api, q_model is the export compressed model.
We will soon update the example of 3. x, which supports detection of auto-device. https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg But we haven't tested the performance on nv GPUs.
i git kaihui/woq_3x_eg branch , and run : CUDA_VISIBLE_DEVICES="2" python run_clm_no_trainer.py \ --model $model_path \ --woq_algo AWQ \ --woq_bits 4 \ --woq_group_size 128 \ --calib_iters 128 \ --woq_scheme asym \ --quantize \ --batch_size 1 \ --tasks wikitext \ --accuracy AutoModelForCausalLM.from_pretrained(debice='cuda') neural-compressor/neural_compressor/torch/algorithms/weight_only/awq.py line 240, in block_calibration: model(*args, **kwargs),the inputs device is cpu, so bug reported: : Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
but another bug in eval:
from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser
File "/*/anaconda3/lib/python3.11/site-packages/intel_extension_for_transformers/transformers/init.py", line 19, in
and, how to load saved_results/quantmodel.pt to evaluate?
Hi, @chunniunai220ml, try with the old version like 2.6 may solve this issue:
ModuleNotFoundError: No module named 'neural_compressor.conf'
https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples
how to set eval_func?
https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_clm_no_trainer.py
it seems no AWQ quantization, just RTN , GPTQ . and as readme.md said, weight-only id fake quantization, why save qmodel (user_model.save(args.output_dir) )?