I quantized a simple CNN model in Pytorch and converted it to onnx. When I tested the runtime of int8 model and fp32 model on CPU, the int8 model was slower. Here my code:
Google colab
The inference result of each model 1000 times:
FP32: 10.28 secs
INT8: 15.52 secs
Can sb explain this problem, please?
To reproduce
Here is my shortened ipynb:
!pip install onnx onnxruntime
import torch
import torch.nn as nn
import torchvision.models as models
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.quant = torch.quantization.QuantStub()
# self.model = YOLO('yolov8n_relu.pt').model.model[0]
self.conv = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
self.conv1 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
self.conv2 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
self.conv3 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
self.conv4 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
# self.act = nn.ReLU()
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.conv(x)
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.dequant(x)
return x
model_fp32 = Model().to('cpu')
input_fp32 = torch.randn(4, 3, 640, 640).to('cpu')
# model must be set to eval mode for static quantization logic to work
model_fp32.eval()
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')
print(model_fp32)
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32)
# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
model_fp32_prepared(input_fp32)
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
print(model_int8)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
import torch.onnx
# Export the quantized model to ONNX format
input_shape = (1, 3, 640, 640) # Input shape of the model
input_names = ["input"] # Names for input nodes
output_names = ["output"] # Names for output nodes
onnx_filename = "original_model.onnx" # Filename for the ONNX model
# Provide example inputs as dummy inputs for tracing
dummy_input = torch.randn(input_shape)
# Export the quantized model to ONNX
torch.onnx.export(model_fp32, dummy_input, onnx_filename, input_names=input_names, output_names=output_names)
print("Model saved as", onnx_filename)
import onnxruntime
import numpy as np
import time
# Load the ONNX model
onnx_model = onnxruntime.InferenceSession("original_model.onnx")
# Create random input data
input_data = np.random.rand(1, 3, 640, 640).astype(np.float32)
# Run the model
start_time = time.time()
for _ in range(1000):
output = onnx_model.run(None, {"input": input_data}) # Replace "input" with the actual input name of your model
end_time = time.time()
# Calculate the inference time
inference_time = end_time - start_time
print("Inference time:", inference_time, "seconds")
# Load the ONNX model
onnx_qt_model = onnxruntime.InferenceSession("quantized_model.onnx")
# Run the model
start_time = time.time()
for _ in range(1000):
output = onnx_qt_model.run(None, {"input": input_data}) # Replace "input" with the actual input name of your model
end_time = time.time()
# Calculate the inference time
inference_time = end_time - start_time
print("Inference time:", inference_time, "seconds")
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
I quantized a simple CNN model in Pytorch and converted it to onnx. When I tested the runtime of int8 model and fp32 model on CPU, the int8 model was slower. Here my code: Google colab The inference result of each model 1000 times:
Can sb explain this problem, please?
To reproduce
Here is my shortened ipynb:
Urgency
I have a project deadline.
Platform
Windows
OS Version
10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
None
Is this a quantized model?
Yes