microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.85k stars 2.94k forks source link

[Performance] pytorch quantize_qat model export to onnx, insert a transpose layer befor and after the conv layers #21702

Open wangyunxiaa opened 3 months ago

wangyunxiaa commented 3 months ago

Describe the issue

I do a qat quantization on a cnn model, when a export it to onnx model, and got a slower inference than torchscript qat model. the result is torchscript: 4.798517942428589 ms onnxrumtime ort mode: 4.9489452838897705 ms I have checked the graph for all models, but found that there was inserted a transpose layer in onnx model image

the torchscript model graph as following: image

I think the transpose cost extra time when do inference.

To reproduce

the model define as: `m = nn.Sequential( nn.Conv2d(2,64,8), nn.ReLU(), nn.Conv2d(64, 128, 8), nn.ReLU(), )

"""Fuse""" torch.quantization.fuse_modules(m, ['0','1'], inplace=True) # fuse first Conv-ReLU pair torch.quantization.fuse_modules(m, ['2','3'], inplace=True) # fuse second Conv-ReLU pair

"""step3.Insert stubs""" m = nn.Sequential(torch.quantization.QuantStub(), *m, torch.quantization.DeQuantStub())

then do opt convert to ort mode command = "python -m onnxruntime.quantization.preprocess --input model_int8.onnx --output model_int8_opt.onnx" process = subprocess.run(command, shell=True, stdout=subprocess.PIPE)

command = "python -m onnxruntime.tools.convert_onnx_models_to_ort model_int8_opt.onnx" process = subprocess.run(command, shell=True, stdout=subprocess.PIPE) `

Urgency

No response

Platform

Linux

OS Version

3.10.0-1160.119.1.el7.x86_64

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

https://github.com/wangyunxiaa/model/blob/master/model_int8.onnx https://github.com/wangyunxiaa/model/blob/master/model_int8_opt.onnx https://github.com/wangyunxiaa/model/blob/master/model_int8_opt.ort

Is this a quantized model?

Yes

yufenglee commented 3 months ago

What hardware did you run on? Can you make your model input and output with NHWC format?

wangyunxiaa commented 3 months ago

What hardware did you run on? Can you make your model input and output with NHWC format?

Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz I cannot input NHWC format data,because the origin model defined withs torch.nn.conv2d

image

when i do export,I just can input the sama data format

wangyunxiaa commented 3 months ago

I found that pytorch have supported torch.channels_last to use NHWC dataformat by m = m.to(memory_format=torch.channels_last) but it can just use input as: x = x.to(memory_format=torch.channels_last) with this setting, the input's shape is still NCHW, so the model exported to onnx still has transpose layer, because the actual input shape is NCHW. this link has said that "All PyTorch operators are written to take NCHW as dimensions order. There is no way to change it (you can only change memory format - aka how tensor laid in memory). " https://discuss.pytorch.org/t/how-to-convert-a-pre-trained-model-from-nchw-to-nhwc-format/97589/11

yufenglee commented 3 months ago

I profiled the model and it shows that the Transpose, DequantizeLinear nad QuantizeLinear takes ~34% of whole model. However, the overhead of Transpose + Q/DQ is fixed. On my local box, the model your shared takes only about 0.3ms. Does your model only have 2 Conv nodes?

------ Top CPU Kernel Times ------ name duration pct count cumulative_pct cumulative_dur QLinearConv 575251 66.85 3018 66.85 575251 Transpose 134695 15.65 3018 82.50 709946 DequantizeLinear 97288 11.31 1509 93.81 807234 QuantizeLinear 53269 6.19 1509 100.00 860503

wangyunxiaa commented 3 months ago

yes, the model is whole. when I run the model,I just use one cpu core by setting `environ["OMP_NUM_THREADS"] = '1' environ["OMP_WAIT_POLICY"] = 'ACTIVE' environ["MKL_NUM_THREADS"] = "1" environ["OPENBLAS_NUM_THREADS"] = "1"

opts = ort.SessionOptions() opts.inter_op_num_threads = 1 opts.intra_op_num_threads = 1 ` so the Transpose layer represents 15% of the total time cosuming, Is this layer generated by converting pytorch qat to onnx necessary?

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.