Static quantized resnet is slower than the raw one

rosrad commented 4 years ago

Describe the bug Static quantized resnet is slower than the raw one.

Compared the static quantized Resnet model and the raw one from the E2E example code. here is some benchmark from my computer

resnet50_v1.onnx / 3.03125 resnet50_v1.quant.onnx / 6.484375

here are the code piece


def perf_test(onnx_path, num=10):
    sess = onnxruntime.InferenceSession(onnx_path)
    name = sess.get_inputs()[0].name
    data = np.random.random((1, 224, 224, 3)).astype(np.float32)
    ort_in = {name: data}
    # warm up
    sess.run(None, ort_in)

    latency = 0
    for i in range(num):
        start_32 = time.process_time()
        ort_out = sess.run(None, ort_in)
        latency = latency + time.process_time() - start_32
    return latency

def latency_benchmark(num=10):
    models = [
        "resnet50_v1.onnx",
        "resnet50_v1.quant.onnx",
    ]
    latency_dict = {model: perf_test(model, num) for model in models}

    for m, l in latency_dict.items():
        print(f"{m} / {l}")

Urgency Quantization performance

System information Windows server onnx==1.7.0 onnxruntime==1.4.0 Python version: 3.6

To Reproduce See bug description

emilianavt commented 3 years ago

I also get lower performance for the quantized version, running single threaded on CPU (2950X). This is with onnxruntime 1.5.2 installed with pip3, running on Debian Linux.

import os
os.environ["OMP_NUM_THREADS"] = "1"

import numpy as np
import onnxruntime
import time

def benchmark(model_path):
    options = onnxruntime.SessionOptions()
    options.inter_op_num_threads = 1
    options.intra_op_num_threads = 1
    options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
    options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
    options.optimized_model_filepath = ""
    session = onnxruntime.InferenceSession(model_path, sess_options=options)
    input_name = session.get_inputs()[0].name

    outputs = None
    total = 0.0
    runs = 5
    input_data = np.zeros((1,224,224,3), np.float32)
    for i in range(runs):
        start = time.perf_counter()
        outputs = session.run([], {input_name: input_data})
        end = (time.perf_counter() - start) * 1000
        total += end
        print(f"{end:.2f}ms")
    total /= runs
    print(f"Avg: {end:.2f}ms")

print("Unquantized")
benchmark("resnet50_v1.onnx")
print("Quantized")
benchmark("resnet50_v1_quant.onnx")

Results:

Unquantized
117.04ms
118.80ms
118.28ms
118.15ms
122.48ms
Avg: 122.48ms
Quantized
140.96ms
140.86ms
139.76ms
139.81ms
139.58ms
Avg: 139.58ms

emilianavt commented 3 years ago

This is still the case with 1.6.0 as well.

hdmjdp commented 3 years ago

have anyone solved it

kanthagirish-rit commented 3 years ago

Running the calibration and quantization script from the E2E example here, for both mobilenet and resnet models results in ValueError.

Output for Resnet:

python run.py --input_model resnet50-v1-9.onnx --output_model resnet50-v1-9.quant.onnx --calibrate_dataset ./test_images/Calibrated,quantized parameters calculated and returned.
Warning: The original model opset version is 9, which does not support quantization. Please update the model to opset >= 11. Updating the model automatically to opset 11. Please verify the quantized model.
Traceback (most recent call last):
  File "run.py", line 112, in <module>
    main()
  File "run.py", line 101, in main
    quantize_static(input_model_path, output_model_path, dr)
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/quantize.py", line 186, in quantize_static
    quantizer.quantize_model()
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 290, in quantize_model
    op_quantizer.quantize()
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/operators/matmul.py", line 69, in quantize
    self.quantizer.quantize_inputs(node, [0, 1])
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 822, in quantize_inputs
    quantize_input_nodes = self._get_quantize_input_nodes(node, input_index, self.input_qType)
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 585, in _get_quantize_input_nodes
    raise ValueError(
ValueError: Quantization parameters are not specified for param flatten_473.In static mode quantization params for inputs and outputs of nodes to be quantized are required.
(.python3_pytorch16)

Output for mobilenet:

python run.py --input_model mobilenetv2-7.onnx --output_model mobilenetv2-7.quant.onnx --calibrate_dataset ./test_images/
Calibrated,quantized parameters calculated and returned.
Warning: The original model opset version is 10, which does not support node fusions. Please update the model to opset >= 11 for better performance.
Traceback (most recent call last):
  File "run.py", line 112, in <module>
    main()
  File "run.py", line 101, in main
    quantize_static(input_model_path, output_model_path, dr)
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/quantize.py", line 186, in quantize_static
    quantizer.quantize_model()
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 290, in quantize_model
    op_quantizer.quantize()
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/operators/matmul.py", line 69, in quantize
    self.quantizer.quantize_inputs(node, [0, 1])
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 822, in quantize_inputs
    quantize_input_nodes = self._get_quantize_input_nodes(node, input_index, self.input_qType)
  File "/home/ubuntu/.python3_pytorch16/lib/python3.8/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 585, in _get_quantize_input_nodes
    raise ValueError(
ValueError: Quantization parameters are not specified for param 472.In static mode quantization params for inputs and outputs of nodes to be quantized are required.

Are these versions of onnx models not prepared for quantization yet?

yufenglee commented 3 years ago

@kanthagirish-rit , changed was made in the master branch to support those models. Could you please try our nightly build: "pip install -i https://test.pypi.org/simple/ ort-nightly" or build from source?

kanthagirish-rit commented 3 years ago

@yufenglee Thanks for a quick update. The fix works for both mobilenetv2 and resnet50! I am not seeing the slower inference times for quantized versions with this fix. Is this final or more changes to come with respect to the slowness issue?

@rosrad and @emilianavt this fix is giving me better inference timings for quantized version of mobilenetv2 (around 60ms) compared to unquantized version (100 ms) on a raspberry 4 device. Do try this fix.

emilianavt commented 3 years ago

With ort-nightly I get matching inference times for the quantized and unquantized resnet50. A dynamically quantized version of the model from #5586 still experiences a significant slowdown (18ms to 37ms), but since it is dynamically quantized it may best be left for a separate issue.

yufenglee commented 3 years ago

@kanthagirish-rit, yes, we are still working on improving the quantization performance. @emilianavt, it is not recommended dynamic quantization for CNN models.

emilianavt commented 3 years ago

Using a build of e9d03983fc63828f84a9dcc8538392e1b22b46bb including the fix for #5586 (thank you!), I have been able to run the model from that issue with static quantization instead of dynamic quantization, still getting a noticable slow down:

Unquantized
18.19ms
18.15ms
17.83ms
17.76ms
18.15ms
Avg: 18.15ms
Quantized
32.85ms
31.37ms
31.11ms
31.37ms
31.38ms
Avg: 31.38ms

emilianavt commented 3 years ago

The slow down still exists in 1.7.0 installed from pip.

emilianavt commented 2 years ago

Using onnxruntime 1.11.0, the results from my previous comment still hold for both uint8 and int8 static quantization:

Unquantized
18.59ms
17.52ms
17.81ms
17.42ms
17.41ms
Avg: 17.41ms
Quantized
29.10ms
28.64ms
28.68ms
29.25ms
28.64ms
Avg: 28.64ms
Quantized int8
29.02ms
28.79ms
29.39ms
29.04ms
29.01ms
Avg: 29.01ms

MrRace commented 2 years ago

Still slower quantize_static than un-quantization version

MrRace commented 2 years ago

@kanthagirish-rit , changed was made in the master branch to support those models. Could you please try our nightly build: "pip install -i https://test.pypi.org/simple/ ort-nightly" or build from source?

@yufenglee Does ort-nightly have GPU version？As we know when want to install GPU version of onnxruntime we can pip install onnxruntime-gpu

microsoft / onnxruntime

Static quantized resnet is slower than the raw one #5319