microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.1k stars 2.84k forks source link

[Performance] INT8 quantized model run slower than FP32 model #20052

Open minhhotboy9x opened 5 months ago

minhhotboy9x commented 5 months ago

Describe the issue

I quantized a simple CNN model in Pytorch and converted it to onnx. When I tested the runtime of int8 model and fp32 model on CPU, the int8 model was slower. Here my code: Google colab The inference result of each model 1000 times:

FP32: 10.28 secs
INT8: 15.52 secs

Can sb explain this problem, please?

To reproduce

Here is my shortened ipynb:

!pip install onnx onnxruntime
import torch
import torch.nn as nn
import torchvision.models as models
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.quant = torch.quantization.QuantStub()
        # self.model = YOLO('yolov8n_relu.pt').model.model[0]
        self.conv = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
        self.conv1 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
        self.conv2 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
        # self.act = nn.ReLU()
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.dequant(x)
        return x
model_fp32 = Model().to('cpu')
input_fp32 = torch.randn(4, 3, 640, 640).to('cpu')
# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')
print(model_fp32)
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset

model_fp32_prepared(input_fp32)
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

print(model_int8)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
import torch.onnx

# Export the quantized model to ONNX format
input_shape = (1, 3, 640, 640)  # Input shape of the model
input_names = ["input"]  # Names for input nodes
output_names = ["output"]  # Names for output nodes
onnx_filename = "original_model.onnx"  # Filename for the ONNX model

# Provide example inputs as dummy inputs for tracing
dummy_input = torch.randn(input_shape)

# Export the quantized model to ONNX
torch.onnx.export(model_fp32, dummy_input, onnx_filename, input_names=input_names, output_names=output_names)

print("Model saved as", onnx_filename)
onnx_qt_filename = "quantized_model.onnx"
torch.onnx.export(model_int8, dummy_input, onnx_qt_filename, input_names=input_names, output_names=output_names)
print("Quantized model saved as", onnx_qt_filename)
import onnxruntime
import numpy as np
import time

# Load the ONNX model
onnx_model = onnxruntime.InferenceSession("original_model.onnx")

# Create random input data
input_data = np.random.rand(1, 3, 640, 640).astype(np.float32)

# Run the model
start_time = time.time()
for _ in range(1000):
    output = onnx_model.run(None, {"input": input_data})  # Replace "input" with the actual input name of your model
end_time = time.time()

# Calculate the inference time
inference_time = end_time - start_time
print("Inference time:", inference_time, "seconds")
# Load the ONNX model
onnx_qt_model = onnxruntime.InferenceSession("quantized_model.onnx")

# Run the model
start_time = time.time()
for _ in range(1000):
    output = onnx_qt_model.run(None, {"input": input_data})  # Replace "input" with the actual input name of your model
end_time = time.time()

# Calculate the inference time
inference_time = end_time - start_time
print("Inference time:", inference_time, "seconds")

Urgency

I have a project deadline.

Platform

Windows

OS Version

10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

None

Is this a quantized model?

Yes

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.