Open rosrad opened 4 years ago
I also get lower performance for the quantized version, running single threaded on CPU (2950X). This is with onnxruntime 1.5.2 installed with pip3, running on Debian Linux.
import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np
import onnxruntime
import time
def benchmark(model_path):
options = onnxruntime.SessionOptions()
options.inter_op_num_threads = 1
options.intra_op_num_threads = 1
options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
options.optimized_model_filepath = ""
session = onnxruntime.InferenceSession(model_path, sess_options=options)
input_name = session.get_inputs()[0].name
outputs = None
total = 0.0
runs = 5
input_data = np.zeros((1,224,224,3), np.float32)
for i in range(runs):
start = time.perf_counter()
outputs = session.run([], {input_name: input_data})
end = (time.perf_counter() - start) * 1000
total += end
print(f"{end:.2f}ms")
total /= runs
print(f"Avg: {end:.2f}ms")
print("Unquantized")
benchmark("resnet50_v1.onnx")
print("Quantized")
benchmark("resnet50_v1_quant.onnx")
Results:
Unquantized
117.04ms
118.80ms
118.28ms
118.15ms
122.48ms
Avg: 122.48ms
Quantized
140.96ms
140.86ms
139.76ms
139.81ms
139.58ms
Avg: 139.58ms
This is still the case with 1.6.0 as well.
have anyone solved it
Running the calibration and quantization script from the E2E example here, for both mobilenet and resnet models results in ValueError
.
Output for Resnet:
python run.py --input_model resnet50-v1-9.onnx --output_model resnet50-v1-9.quant.onnx --calibrate_dataset ./test_images/Calibrated,quantized parameters calculated and returned.
Warning: The original model opset version is 9, which does not support quantization. Please update the model to opset >= 11. Updating the model automatically to opset 11. Please verify the quantized model.
Traceback (most recent call last):
File "run.py", line 112, in <module>
main()
File "run.py", line 101, in main
quantize_static(input_model_path, output_model_path, dr)
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/quantize.py", line 186, in quantize_static
quantizer.quantize_model()
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 290, in quantize_model
op_quantizer.quantize()
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/operators/matmul.py", line 69, in quantize
self.quantizer.quantize_inputs(node, [0, 1])
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 822, in quantize_inputs
quantize_input_nodes = self._get_quantize_input_nodes(node, input_index, self.input_qType)
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 585, in _get_quantize_input_nodes
raise ValueError(
ValueError: Quantization parameters are not specified for param flatten_473.In static mode quantization params for inputs and outputs of nodes to be quantized are required.
(.python3_pytorch16)
Output for mobilenet:
python run.py --input_model mobilenetv2-7.onnx --output_model mobilenetv2-7.quant.onnx --calibrate_dataset ./test_images/
Calibrated,quantized parameters calculated and returned.
Warning: The original model opset version is 10, which does not support node fusions. Please update the model to opset >= 11 for better performance.
Traceback (most recent call last):
File "run.py", line 112, in <module>
main()
File "run.py", line 101, in main
quantize_static(input_model_path, output_model_path, dr)
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/quantize.py", line 186, in quantize_static
quantizer.quantize_model()
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 290, in quantize_model
op_quantizer.quantize()
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/operators/matmul.py", line 69, in quantize
self.quantizer.quantize_inputs(node, [0, 1])
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 822, in quantize_inputs
quantize_input_nodes = self._get_quantize_input_nodes(node, input_index, self.input_qType)
File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 585, in _get_quantize_input_nodes
raise ValueError(
ValueError: Quantization parameters are not specified for param 472.In static mode quantization params for inputs and outputs of nodes to be quantized are required.
Are these versions of onnx models not prepared for quantization yet?
@kanthagirish-rit , changed was made in the master branch to support those models. Could you please try our nightly build: "pip install -i https://test.pypi.org/simple/ ort-nightly" or build from source?
@yufenglee Thanks for a quick update. The fix works for both mobilenetv2 and resnet50! I am not seeing the slower inference times for quantized versions with this fix. Is this final or more changes to come with respect to the slowness issue?
@rosrad and @emilianavt this fix is giving me better inference timings for quantized version of mobilenetv2 (around 60ms) compared to unquantized version (100 ms) on a raspberry 4 device. Do try this fix.
With ort-nightly I get matching inference times for the quantized and unquantized resnet50. A dynamically quantized version of the model from #5586 still experiences a significant slowdown (18ms to 37ms), but since it is dynamically quantized it may best be left for a separate issue.
@kanthagirish-rit, yes, we are still working on improving the quantization performance. @emilianavt, it is not recommended dynamic quantization for CNN models.
Using a build of e9d03983fc63828f84a9dcc8538392e1b22b46bb including the fix for #5586 (thank you!), I have been able to run the model from that issue with static quantization instead of dynamic quantization, still getting a noticable slow down:
Unquantized
18.19ms
18.15ms
17.83ms
17.76ms
18.15ms
Avg: 18.15ms
Quantized
32.85ms
31.37ms
31.11ms
31.37ms
31.38ms
Avg: 31.38ms
The slow down still exists in 1.7.0 installed from pip.
Using onnxruntime 1.11.0, the results from my previous comment still hold for both uint8 and int8 static quantization:
Unquantized
18.59ms
17.52ms
17.81ms
17.42ms
17.41ms
Avg: 17.41ms
Quantized
29.10ms
28.64ms
28.68ms
29.25ms
28.64ms
Avg: 28.64ms
Quantized int8
29.02ms
28.79ms
29.39ms
29.04ms
29.01ms
Avg: 29.01ms
Still slower quantize_static than un-quantization version
@kanthagirish-rit , changed was made in the master branch to support those models. Could you please try our nightly build: "pip install -i https://test.pypi.org/simple/ ort-nightly" or build from source?
@yufenglee Does ort-nightly have GPU version?As we know when want to install GPU version of onnxruntime we can pip install onnxruntime-gpu
Describe the bug Static quantized resnet is slower than the raw one.
Compared the static quantized Resnet model and the raw one from the E2E example code. here is some benchmark from my computer
resnet50_v1.onnx / 3.03125 resnet50_v1.quant.onnx / 6.484375
here are the code piece
Urgency Quantization performance
System information Windows server onnx==1.7.0 onnxruntime==1.4.0 Python version: 3.6
To Reproduce See bug description