Closed visheratin closed 10 months ago
Hi @visheratin, due to the limitation of INT4 ort kernel at the start of weight-only quantization development, we only quant the weight of MatMul to INT4 and then dequant it to FP32 to validate weight-only accuracy. If you want to use INT4 kernel, please install latest onnxruntime1.16.0 and install INC with mengni/1.16 branch, then you can see UINT8 weight in MatMulFpQ4 ops.
Thank you, @mengniwang95! The quantization worked.
But now, when I try to run the quantized model, I get the following error:
2023-10-09 10:58:31.283093401 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running MatMulFpQ4 node. Name:'/vision_model/encoder/layers.0/self_attn/q_proj/MatMul_Q4' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc:55 virtual onnxruntime::common::Status onnxruntime::contrib::MatMulFpQ4::Compute(onnxruntime::OpKernelContext*) const buf_size > 0 was false. Operator MatMulFpQ4 not yet supported on this hardware platform.
The code I'm using is a regular ONNX session run:
import onnxruntime
session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = onnxruntime.InferenceSession(os.path.join(model_dir, q_model_name), session_options, providers=['CPUExecutionProvider'])
ort_inputs = {
"pixel_values": image_inputs['pixel_values'].numpy(),
}
image_outs = ort_session.run(None, ort_inputs)
I have ONNX Runtime version 1.16.0 installed:
> pip list | grep onnxruntime
onnxruntime 1.16.0
onnxruntime-extensions 0.4.2
onnxruntime-gpu 1.16.0
onnxruntime-tools 1.7.0
I got the same issue:
failed:Node (/query_key_value/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!
@mengniwang95 Could you help me deal with this issue?
My onnxruntime version is 1.16.0.
This issue seems not directly related to the neural compressor but the onnxruntime. I have posted more detailed issue in onnxruntime: issue_link. If anyone knows how to solve this issue, please help us with it.
Thanks!
Hi @visheratin @zjc664656505 , please try inference on CPU platform, it seems this op only can run on CPU now. "on CPU platform" means only CPU hardware rather than uninstall onnxruntime-gpu or only use CPUExecutionProvider.
After upgrading to ONNX Runtime 1.16.1, the model loads but not fails when running:
[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MatMulFpQ4 node. Name:'/vision_model/encoder/layers.0/self_attn/q_proj/MatMul_Q4' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc:55 virtual onnxruntime::common::Status onnxruntime::contrib::MatMulFpQ4::Compute(onnxruntime::OpKernelContext*) const buf_size > 0 was false. Operator MatMulFpQ4 not yet supported on this hardware platform.
Here is the Google Colab to reproduce. I used a CPU machine, which implies CPU-only hardware.
Hi @visheratin , I am so sorry I can't open google drive since some limitation.
I also do some tests on my local env, and below is my test result: ort1.16.0, CPU machine, can load and run ort1.16.0, CPU machine, can load and run ort1.16.1, GPU machine, can't load ort1.16.1, GPU machine, can't load
Could you share your INT4 model so I can check it?
Here is the link to the model - https://drive.google.com/uc?export=download&id=1wDhmp2iVXFDcLvILSRMdHcKYKx955Obu
I tried both 1.16.0 and 1.16.1, the error is the same.
The code to reproduce:
from transformers import AutoTokenizer, CLIPProcessor
import requests
from PIL import Image
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
processor = processor.image_processor
image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)
image_inputs = processor(images=image, return_tensors="pt")
import onnxruntime
session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = onnxruntime.InferenceSession("model.onnx", session_options, providers=['CPUExecutionProvider'])
ort_inputs = {
"pixel_values": image_inputs['pixel_values'].numpy(),
}
image_outs = ort_session.run(None, ort_inputs)
I'm also facing the same issue. I have uninstalled the onnxruntime-gpu and my conda environment only has onnxruntime==1.16.0 for now. If it's possible, may I know what is the current hardware platform you are running with the int4 quantized model?
My current hardware platform is 2.6 GHz 6-Core Intel Core i7.
Thanks!
@mengniwang95
I think the reason that MatMulfq4 is not working is that it's currently only supported to limited cpu platforms, and onnx has not officially released the full support on it. I have push an issue to their git repo but have not heard anything back regarding this issue.
Hi @visheratin , I can run your code with your model successfully.
My hardware platform is Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz @zjc664656505
Hi @mengniwang95, may I ask whether you install the onnxruntime through pip
or you have manually built it in your local environment because using the onnxruntime installed through pip, it's showing that
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MatMulFpQ4 node. Name:'/vision_model/encoder/layers.0/self_attn/q_proj/MatMul_Q4' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc:55 virtual onnxruntime::common::Status onnxruntime::contrib::MatMulFpQ4::Compute(onnxruntime::OpKernelContext*) const buf_size > 0 was false. Operator MatMulFpQ4 not yet supported on this hardware platform.
This is tested in an isolated conda environment where the onnxruntime-gpu is not installed and the onnxruntime version is 1.16.0. My hardware platform is AMD Ryzen Threadripper 3970X 32-Core Processor
. Is it possible that this issue happens due to the hardware platform? For example, onnxruntime currently may not support the AMD platform for the MatMulFpQ4 operator.
Hi @zjc664656505 , I install onnxruntime through pip.
I read the source code of onnxruntime and found it require the processor supports AVX512 core features: https://github.com/microsoft/onnxruntime/blob/209b6dbd975efbc792b5ca9ae1dd74b828559148/onnxruntime/contrib_ops/cpu/matmul_fpq4.cc#L55C21-L55C21 https://github.com/microsoft/onnxruntime/blob/209b6dbd975efbc792b5ca9ae1dd74b828559148/onnxruntime/core/mlas/lib/q4_dq.cpp#L37C18-L37C18 https://github.com/microsoft/onnxruntime/blob/209b6dbd975efbc792b5ca9ae1dd74b828559148/onnxruntime/core/mlas/lib/platform.cpp#L401
Interesting. It seems that the current onnx poses a strict restriction on the specific hardware platform for the int4 operation. I will double check my hardware specification for sure. Thank you so much @mengniwang95 !
Please re-open the issue if you still have it.
Hi @visheratin, due to the limitation of INT4 ort kernel at the start of weight-only quantization development, we only quant the weight of MatMul to INT4 and then dequant it to FP32 to validate weight-only accuracy. If you want to use INT4 kernel, please install latest onnxruntime1.16.0 and install INC with mengni/1.16 branch, then you can see UINT8 weight in MatMulFpQ4 ops.
Hi @mengniwang95 this branch mengni/1.16 is gone now. I saw your git commit in the main branch, is that going to suffice for running with INT4 kernel?
Hi!
I'm trying to quantize a ViT model to ONNX using examples from the repo and HF model cards. My code is as follows:
The output looks fine:
But the resulting file is about 200 KB smaller than the original. When I explore the model using Netron, the weights are still float32 and the values are the same as in the original model:
The model (gzipped) can be found here.
Could you please help me to understand how to use weight-only quantization properly?