intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.23k stars 257 forks source link

No size reduction for weight-only quantization #1302

Closed visheratin closed 10 months ago

visheratin commented 1 year ago

Hi!

I'm trying to quantize a ViT model to ONNX using examples from the repo and HF model cards. My code is as follows:

from neural_compressor import PostTrainingQuantConfig, quantization
from neural_compressor.utils.constant import FP32
import os

model_dir = "/path/to/model"
model = "image.onnx"
config = PostTrainingQuantConfig(
    domain="nlp",
    approach="weight_only",
    calibration_sampling_size=[8],
    op_type_dict={
        ".*": {
            "weight": {
                "bits": 4,
                "algorithm": ["RTN"], 
                "scheme": ["asym"], 
                "group_size": 32,
            }
        }
    },
)
q_model = quantization.fit(
    os.path.join(model_dir, model),
    config,)
q_model_name = "image_int4.onnx"
q_model.save(os.path.join(model_dir, q_model_name))

The output looks fine:

2023-10-08 13:37:24 [INFO] Start auto tuning.
2023-10-08 13:37:24 [INFO] Quantize model without tuning!
2023-10-08 13:37:24 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2023-10-08 13:37:24 [INFO] Adaptor has 5 recipes.
2023-10-08 13:37:24 [INFO] 0 recipes specified by user.
2023-10-08 13:37:24 [INFO] 3 recipes require future tuning.
2023-10-08 13:37:24 [INFO] *** Initialize auto tuning
2023-10-08 13:37:24 [INFO] {
2023-10-08 13:37:24 [INFO]     'PostTrainingQuantConfig': {
2023-10-08 13:37:24 [INFO]         'AccuracyCriterion': {
2023-10-08 13:37:24 [INFO]             'criterion': 'relative',
2023-10-08 13:37:24 [INFO]             'higher_is_better': True,
2023-10-08 13:37:24 [INFO]             'tolerable_loss': 0.01,
2023-10-08 13:37:24 [INFO]             'absolute': None,
2023-10-08 13:37:24 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x7f0774cc7ee0>>,
2023-10-08 13:37:24 [INFO]             'relative': 0.01
2023-10-08 13:37:24 [INFO]         },
2023-10-08 13:37:24 [INFO]         'approach': 'post_training_weight_only',
2023-10-08 13:37:24 [INFO]         'backend': 'default',
2023-10-08 13:37:24 [INFO]         'calibration_sampling_size': [
2023-10-08 13:37:24 [INFO]             8
2023-10-08 13:37:24 [INFO]         ],
2023-10-08 13:37:24 [INFO]         'device': 'cpu',
2023-10-08 13:37:24 [INFO]         'diagnosis': False,
2023-10-08 13:37:24 [INFO]         'domain': 'nlp',
2023-10-08 13:37:24 [INFO]         'example_inputs': None,
2023-10-08 13:37:24 [INFO]         'excluded_precisions': [
2023-10-08 13:37:24 [INFO]         ],
2023-10-08 13:37:24 [INFO]         'framework': 'onnxruntime',
2023-10-08 13:37:24 [INFO]         'inputs': [
2023-10-08 13:37:24 [INFO]         ],
2023-10-08 13:37:24 [INFO]         'model_name': '',
2023-10-08 13:37:24 [INFO]         'ni_workload_name': 'quantization',
2023-10-08 13:37:24 [INFO]         'op_name_dict': None,
2023-10-08 13:37:24 [INFO]         'op_type_dict': {
2023-10-08 13:37:24 [INFO]             '.*': {
2023-10-08 13:37:24 [INFO]                 'weight': {
2023-10-08 13:37:24 [INFO]                     'bits': [
2023-10-08 13:37:24 [INFO]                         4
2023-10-08 13:37:24 [INFO]                     ],
2023-10-08 13:37:24 [INFO]                     'algorithm': [
2023-10-08 13:37:24 [INFO]                         'RTN'
2023-10-08 13:37:24 [INFO]                     ],
2023-10-08 13:37:25 [INFO]                     'scheme': [
2023-10-08 13:37:25 [INFO]                         'asym'
2023-10-08 13:37:25 [INFO]                     ],
2023-10-08 13:37:25 [INFO]                     'group_size': [
2023-10-08 13:37:25 [INFO]                         32
2023-10-08 13:37:25 [INFO]                     ]
2023-10-08 13:37:25 [INFO]                 }
2023-10-08 13:37:25 [INFO]             }
2023-10-08 13:37:25 [INFO]         },
2023-10-08 13:37:25 [INFO]         'outputs': [
2023-10-08 13:37:25 [INFO]         ],
2023-10-08 13:37:25 [INFO]         'quant_format': 'default',
2023-10-08 13:37:25 [INFO]         'quant_level': 'auto',
2023-10-08 13:37:25 [INFO]         'recipes': {
2023-10-08 13:37:25 [INFO]             'smooth_quant': False,
2023-10-08 13:37:25 [INFO]             'smooth_quant_args': {
2023-10-08 13:37:25 [INFO]             },
2023-10-08 13:37:25 [INFO]             'layer_wise_quant': False,
2023-10-08 13:37:25 [INFO]             'layer_wise_quant_args': {
2023-10-08 13:37:25 [INFO]             },
2023-10-08 13:37:25 [INFO]             'fast_bias_correction': False,
2023-10-08 13:37:25 [INFO]             'weight_correction': False,
2023-10-08 13:37:25 [INFO]             'gemm_to_matmul': True,
2023-10-08 13:37:25 [INFO]             'graph_optimization_level': None,
2023-10-08 13:37:25 [INFO]             'first_conv_or_matmul_quantization': True,
2023-10-08 13:37:25 [INFO]             'last_conv_or_matmul_quantization': True,
2023-10-08 13:37:25 [INFO]             'pre_post_process_quantization': True,
2023-10-08 13:37:25 [INFO]             'add_qdq_pair_to_weight': False,
2023-10-08 13:37:25 [INFO]             'optypes_to_exclude_output_quant': [
2023-10-08 13:37:25 [INFO]             ],
2023-10-08 13:37:25 [INFO]             'dedicated_qdq_pair': False,
2023-10-08 13:37:25 [INFO]             'rtn_args': {
2023-10-08 13:37:25 [INFO]             },
2023-10-08 13:37:25 [INFO]             'awq_args': {
2023-10-08 13:37:25 [INFO]             },
2023-10-08 13:37:25 [INFO]             'gptq_args': {
2023-10-08 13:37:25 [INFO]             },
2023-10-08 13:37:25 [INFO]             'teq_args': {
2023-10-08 13:37:25 [INFO]             }
2023-10-08 13:37:25 [INFO]         },
2023-10-08 13:37:25 [INFO]         'reduce_range': None,
2023-10-08 13:37:25 [INFO]         'TuningCriterion': {
2023-10-08 13:37:25 [INFO]             'max_trials': 100,
2023-10-08 13:37:25 [INFO]             'objective': [
2023-10-08 13:37:25 [INFO]                 'performance'
2023-10-08 13:37:25 [INFO]             ],
2023-10-08 13:37:25 [INFO]             'strategy': 'basic',
2023-10-08 13:37:25 [INFO]             'strategy_kwargs': None,
2023-10-08 13:37:25 [INFO]             'timeout': 0
2023-10-08 13:37:25 [INFO]         },
2023-10-08 13:37:25 [INFO]         'use_bf16': True
2023-10-08 13:37:25 [INFO]     }
2023-10-08 13:37:25 [INFO] }
2023-10-08 13:37:25 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2023-10-08 13:37:25 [WARNING] Graph optimization level is automatically set to ENABLE_EXTENDED. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2023-10-08 13:37:27 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2023-10-08 13:37:27 [INFO] Quantize the model with default config.
2023-10-08 13:37:30 [INFO] |******Mixed Precision Statistics******|
2023-10-08 13:37:30 [INFO] +---------+-------+-----------+--------+
2023-10-08 13:37:30 [INFO] | Op Type | Total |  A32W4G32 |  FP32  |
2023-10-08 13:37:30 [INFO] +---------+-------+-----------+--------+
2023-10-08 13:37:30 [INFO] |  MatMul |   85  |     73    |   12   |
2023-10-08 13:37:30 [INFO] +---------+-------+-----------+--------+
2023-10-08 13:37:30 [INFO] Pass quantize model elapsed time: 3501.01 ms
2023-10-08 13:37:30 [INFO] Save tuning history to /home/alexvish/src/laion-nllb-repo/train/nc_workspace/2023-10-08_12-45-42/./history.snapshot.
2023-10-08 13:37:31 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2023-10-08 13:37:31 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2023-10-08 13:37:31 [INFO] Save deploy yaml to /home/alexvish/src/laion-nllb-repo/train/nc_workspace/2023-10-08_12-45-42/deploy.yaml

But the resulting file is about 200 KB smaller than the original. When I explore the model using Netron, the weights are still float32 and the values are the same as in the original model:

image

The model (gzipped) can be found here.

Could you please help me to understand how to use weight-only quantization properly?

mengniwang95 commented 1 year ago

Hi @visheratin, due to the limitation of INT4 ort kernel at the start of weight-only quantization development, we only quant the weight of MatMul to INT4 and then dequant it to FP32 to validate weight-only accuracy. If you want to use INT4 kernel, please install latest onnxruntime1.16.0 and install INC with mengni/1.16 branch, then you can see UINT8 weight in MatMulFpQ4 ops.

visheratin commented 1 year ago

Thank you, @mengniwang95! The quantization worked.

But now, when I try to run the quantized model, I get the following error:

2023-10-09 10:58:31.283093401 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running MatMulFpQ4 node. Name:'/vision_model/encoder/layers.0/self_attn/q_proj/MatMul_Q4' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc:55 virtual onnxruntime::common::Status onnxruntime::contrib::MatMulFpQ4::Compute(onnxruntime::OpKernelContext*) const buf_size > 0 was false. Operator MatMulFpQ4 not yet supported on this hardware platform.

The code I'm using is a regular ONNX session run:

import onnxruntime

session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = onnxruntime.InferenceSession(os.path.join(model_dir, q_model_name), session_options, providers=['CPUExecutionProvider'])
ort_inputs = {
    "pixel_values": image_inputs['pixel_values'].numpy(),
}
image_outs = ort_session.run(None, ort_inputs)

I have ONNX Runtime version 1.16.0 installed:

> pip list | grep onnxruntime
onnxruntime                  1.16.0               
onnxruntime-extensions       0.4.2                
onnxruntime-gpu              1.16.0               
onnxruntime-tools            1.7.0                
zjc664656505 commented 1 year ago

I got the same issue: failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform! @mengniwang95 Could you help me deal with this issue?

My onnxruntime version is 1.16.0.

Update

This issue seems not directly related to the neural compressor but the onnxruntime. I have posted more detailed issue in onnxruntime: issue_link. If anyone knows how to solve this issue, please help us with it.

Thanks!

mengniwang95 commented 1 year ago

Hi @visheratin @zjc664656505 , please try inference on CPU platform, it seems this op only can run on CPU now. "on CPU platform" means only CPU hardware rather than uninstall onnxruntime-gpu or only use CPUExecutionProvider.

visheratin commented 1 year ago

After upgrading to ONNX Runtime 1.16.1, the model loads but not fails when running:

[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MatMulFpQ4 node. Name:'/vision_model/encoder/layers.0/self_attn/q_proj/MatMul_Q4' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc:55 virtual onnxruntime::common::Status onnxruntime::contrib::MatMulFpQ4::Compute(onnxruntime::OpKernelContext*) const buf_size > 0 was false. Operator MatMulFpQ4 not yet supported on this hardware platform.

Here is the Google Colab to reproduce. I used a CPU machine, which implies CPU-only hardware.

mengniwang95 commented 1 year ago

Hi @visheratin , I am so sorry I can't open google drive since some limitation.

I also do some tests on my local env, and below is my test result: ort1.16.0, CPU machine, can load and run ort1.16.0, CPU machine, can load and run ort1.16.1, GPU machine, can't load ort1.16.1, GPU machine, can't load

Could you share your INT4 model so I can check it?

visheratin commented 1 year ago

Here is the link to the model - https://drive.google.com/uc?export=download&id=1wDhmp2iVXFDcLvILSRMdHcKYKx955Obu

I tried both 1.16.0 and 1.16.1, the error is the same.

The code to reproduce:

from transformers import AutoTokenizer, CLIPProcessor
import requests
from PIL import Image

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
processor = processor.image_processor
image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)
image_inputs = processor(images=image, return_tensors="pt")

import onnxruntime

session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = onnxruntime.InferenceSession("model.onnx", session_options, providers=['CPUExecutionProvider'])
ort_inputs = {
    "pixel_values": image_inputs['pixel_values'].numpy(),
}
image_outs = ort_session.run(None, ort_inputs)
zjc664656505 commented 1 year ago

I'm also facing the same issue. I have uninstalled the onnxruntime-gpu and my conda environment only has onnxruntime==1.16.0 for now. If it's possible, may I know what is the current hardware platform you are running with the int4 quantized model?

My current hardware platform is 2.6 GHz 6-Core Intel Core i7.

Thanks!

@mengniwang95

zjc664656505 commented 1 year ago

I think the reason that MatMulfq4 is not working is that it's currently only supported to limited cpu platforms, and onnx has not officially released the full support on it. I have push an issue to their git repo but have not heard anything back regarding this issue.

mengniwang95 commented 1 year ago

Hi @visheratin , I can run your code with your model successfully.

My hardware platform is Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz @zjc664656505

zjc664656505 commented 1 year ago

Hi @mengniwang95, may I ask whether you install the onnxruntime through pip or you have manually built it in your local environment because using the onnxruntime installed through pip, it's showing that

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MatMulFpQ4 node. Name:'/vision_model/encoder/layers.0/self_attn/q_proj/MatMul_Q4' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc:55 virtual onnxruntime::common::Status onnxruntime::contrib::MatMulFpQ4::Compute(onnxruntime::OpKernelContext*) const buf_size > 0 was false. Operator MatMulFpQ4 not yet supported on this hardware platform.

This is tested in an isolated conda environment where the onnxruntime-gpu is not installed and the onnxruntime version is 1.16.0. My hardware platform is AMD Ryzen Threadripper 3970X 32-Core Processor. Is it possible that this issue happens due to the hardware platform? For example, onnxruntime currently may not support the AMD platform for the MatMulFpQ4 operator.

mengniwang95 commented 1 year ago

Hi @zjc664656505 , I install onnxruntime through pip.

I read the source code of onnxruntime and found it require the processor supports AVX512 core features: https://github.com/microsoft/onnxruntime/blob/209b6dbd975efbc792b5ca9ae1dd74b828559148/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc#L55C21-L55C21 https://github.com/microsoft/onnxruntime/blob/209b6dbd975efbc792b5ca9ae1dd74b828559148/onnxruntime/core/mlas/lib/q4_dq.cpp#L37C18-L37C18 https://github.com/microsoft/onnxruntime/blob/209b6dbd975efbc792b5ca9ae1dd74b828559148/onnxruntime/core/mlas/lib/platform.cpp#L401

zjc664656505 commented 1 year ago

Interesting. It seems that the current onnx poses a strict restriction on the specific hardware platform for the int4 operation. I will double check my hardware specification for sure. Thank you so much @mengniwang95 !

chensuyue commented 10 months ago

Please re-open the issue if you still have it.

hhxxttxsh commented 9 months ago

Hi @visheratin, due to the limitation of INT4 ort kernel at the start of weight-only quantization development, we only quant the weight of MatMul to INT4 and then dequant it to FP32 to validate weight-only accuracy. If you want to use INT4 kernel, please install latest onnxruntime1.16.0 and install INC with mengni/1.16 branch, then you can see UINT8 weight in MatMulFpQ4 ops.

Hi @mengniwang95 this branch mengni/1.16 is gone now. I saw your git commit in the main branch, is that going to suffice for running with INT4 kernel?