Closed 0400H closed 4 months ago
Hi @0400H,
Thanks for trying neural-compressor.
I had quantized the meta-llama/Llama-2-7b-hf model via this example, and get gptq quantized outputs:
saved_results |- best_model.pt(26G) gptq_config.json(2.9G) qconfig.json
- But I have not found the tutorial of loading quantized LLM model. could u give me some help? The run_clm_no_trainer.py example quantized, store and load the model.
- https://github.com/intel/neural-compressor/blob/v2.6.dev0/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py#L336 neural-compressor saves the mode q_model.save(args.output_dir)
https://github.com/intel/neural-compressor/blob/v2.6.dev0/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py#L343 it load the quantized model from the output_dir: from neural_compressor.utils.pytorch import load user_model = load(os.path.abspath(os.path.expanduser(args.output_dir)))
- After loading quantized LLM, i will use optimum-benchmark to do some benchmark.
- Why the quantized model(26G) is bigger than unquantized model(12.6G)?
Could you show your command detail and log? There are intermedate data. Which model file you check the 'quantized model'. For LLM quantized, we recommend weight only quantization.
Hi @0400H, Welcome to the Intel Neural Compressor.
The model you saved is a 32-bit fake quantized model. Executing q_model.export_compressed_model()
is required to get the same packed parameters as AutoGPTQ.
Intel Neural Compressor doesn't provide amazing performance for weight only quantization. It aims to supporting popular model compression techniques on all mainstream deep learning frameworks on python level. For SOTA performance, we provide another ease-of-use repo intel-extension-for-transformers, which inherits Intel Neural Compressor and provides transformer-liked API. Welcome to try~
Hi @0400H, Welcome to the Intel Neural Compressor. The model you saved is a 32-bit fake quantized model. Executing
q_model.export_compressed_model()
is required to get the same packed parameters as AutoGPTQ.
- After loading quantized LLM, i will use optimum-benchmark to do some benchmark.
- Why the quantized model(26G) is bigger than unquantized model(12.6G)?
@xiguiw
python run_clm_no_trainer.py \
--model meta-llama/Llama-2-7b-hf \
--dataset NeelNanda/pile-10k \
--seed 0 \
--quantize \
--approach weight_only \
--woq_algo GPTQ \
--woq_bits 4 \
--woq_scheme asym \
--woq_group_size 128 \
--gptq_pad_max_length 2048 \
--gptq_use_max_length
Hi @0400H, Welcome to the Intel Neural Compressor. The model you saved is a 32-bit fake quantized model. Executing
q_model.export_compressed_model()
is required to get the same packed parameters as AutoGPTQ.Intel Neural Compressor doesn't provide amazing performance for weight only quantization. It aims to supporting popular model compression techniques on all mainstream deep learning frameworks on python level. For SOTA performance, we provide another ease-of-use repo intel-extension-for-transformers, which inherits Intel Neural Compressor and provides transformer-liked API. Welcome to try~
@xin3he Thanks for your explanation about "saved_results", and i have turned to Intel Extension for Transformers
I had quantized the meta-llama/Llama-2-7b-hf model via this example, and get gptq quantized outputs:
But I have not found the tutorial of loading quantized LLM model. could u give me some help?
After loading quantized LLM, i will use optimum-benchmark to do some benchmark.
Why the quantized model(26G) is bigger than unquantized model(12.6G)?