intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.15k stars 252 forks source link

How to load quantized LLM and do inference? #1776

Closed 0400H closed 4 months ago

0400H commented 4 months ago

I had quantized the meta-llama/Llama-2-7b-hf model via this example, and get gptq quantized outputs:

saved_results
    |- best_model.pt(26G)  gptq_config.json(2.9G)  qconfig.json
  1. But I have not found the tutorial of loading quantized LLM model. could u give me some help?

  2. After loading quantized LLM, i will use optimum-benchmark to do some benchmark.

  3. Why the quantized model(26G) is bigger than unquantized model(12.6G)?

xiguiw commented 4 months ago

Hi @0400H,

Thanks for trying neural-compressor.

I had quantized the meta-llama/Llama-2-7b-hf model via this example, and get gptq quantized outputs:

saved_results
    |- best_model.pt(26G)  gptq_config.json(2.9G)  qconfig.json
  1. But I have not found the tutorial of loading quantized LLM model. could u give me some help? The run_clm_no_trainer.py example quantized, store and load the model.
  2. https://github.com/intel/neural-compressor/blob/v2.6.dev0/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py#L336 neural-compressor saves the mode q_model.save(args.output_dir)

https://github.com/intel/neural-compressor/blob/v2.6.dev0/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py#L343 it load the quantized model from the output_dir: from neural_compressor.utils.pytorch import load user_model = load(os.path.abspath(os.path.expanduser(args.output_dir)))

  1. After loading quantized LLM, i will use optimum-benchmark to do some benchmark.
  2. Why the quantized model(26G) is bigger than unquantized model(12.6G)?

Could you show your command detail and log? There are intermedate data. Which model file you check the 'quantized model'. For LLM quantized, we recommend weight only quantization.

xin3he commented 4 months ago

Hi @0400H, Welcome to the Intel Neural Compressor. The model you saved is a 32-bit fake quantized model. Executing q_model.export_compressed_model() is required to get the same packed parameters as AutoGPTQ.

https://github.com/intel/neural-compressor/blob/7b8aec00d0c09bd499076457b68903229e09b803/docs/source/quantization_weight_only.md?plain=1#L135-L138

xin3he commented 4 months ago

Intel Neural Compressor doesn't provide amazing performance for weight only quantization. It aims to supporting popular model compression techniques on all mainstream deep learning frameworks on python level. For SOTA performance, we provide another ease-of-use repo intel-extension-for-transformers, which inherits Intel Neural Compressor and provides transformer-liked API. Welcome to try~

0400H commented 4 months ago

Hi @0400H, Welcome to the Intel Neural Compressor. The model you saved is a 32-bit fake quantized model. Executing q_model.export_compressed_model() is required to get the same packed parameters as AutoGPTQ.

https://github.com/intel/neural-compressor/blob/7b8aec00d0c09bd499076457b68903229e09b803/docs/source/quantization_weight_only.md?plain=1#L135-L138

  • After loading quantized LLM, i will use optimum-benchmark to do some benchmark.
  • Why the quantized model(26G) is bigger than unquantized model(12.6G)?

@xiguiw

python run_clm_no_trainer.py \
    --model meta-llama/Llama-2-7b-hf \
    --dataset NeelNanda/pile-10k \
    --seed 0 \
    --quantize \
    --approach weight_only \
    --woq_algo GPTQ \
    --woq_bits 4 \
    --woq_scheme asym \
    --woq_group_size 128 \
    --gptq_pad_max_length 2048 \
    --gptq_use_max_length
0400H commented 4 months ago

Hi @0400H, Welcome to the Intel Neural Compressor. The model you saved is a 32-bit fake quantized model. Executing q_model.export_compressed_model() is required to get the same packed parameters as AutoGPTQ.

https://github.com/intel/neural-compressor/blob/7b8aec00d0c09bd499076457b68903229e09b803/docs/source/quantization_weight_only.md?plain=1#L135-L138

Intel Neural Compressor doesn't provide amazing performance for weight only quantization. It aims to supporting popular model compression techniques on all mainstream deep learning frameworks on python level. For SOTA performance, we provide another ease-of-use repo intel-extension-for-transformers, which inherits Intel Neural Compressor and provides transformer-liked API. Welcome to try~

@xin3he Thanks for your explanation about "saved_results", and i have turned to Intel Extension for Transformers