Open wangyunxiaa opened 2 months ago
What hardware did you run on? Can you make your model input and output with NHWC format?
What hardware did you run on? Can you make your model input and output with NHWC format?
Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz I cannot input NHWC format data,because the origin model defined withs torch.nn.conv2d
when i do export,I just can input the sama data format
I found that pytorch have supported torch.channels_last to use NHWC dataformat by
m = m.to(memory_format=torch.channels_last)
but it can just use input as:
x = x.to(memory_format=torch.channels_last)
with this setting, the input's shape is still NCHW, so the model exported to onnx still has transpose layer, because the actual input shape is NCHW.
this link has said that "All PyTorch operators are written to take NCHW as dimensions order. There is no way to change it (you can only change memory format - aka how tensor laid in memory).
"
https://discuss.pytorch.org/t/how-to-convert-a-pre-trained-model-from-nchw-to-nhwc-format/97589/11
I profiled the model and it shows that the Transpose, DequantizeLinear nad QuantizeLinear takes ~34% of whole model. However, the overhead of Transpose + Q/DQ is fixed. On my local box, the model your shared takes only about 0.3ms. Does your model only have 2 Conv nodes?
------ Top CPU Kernel Times ------ name duration pct count cumulative_pct cumulative_dur QLinearConv 575251 66.85 3018 66.85 575251 Transpose 134695 15.65 3018 82.50 709946 DequantizeLinear 97288 11.31 1509 93.81 807234 QuantizeLinear 53269 6.19 1509 100.00 860503
yes, the model is whole. when I run the model,I just use one cpu core by setting `environ["OMP_NUM_THREADS"] = '1' environ["OMP_WAIT_POLICY"] = 'ACTIVE' environ["MKL_NUM_THREADS"] = "1" environ["OPENBLAS_NUM_THREADS"] = "1"
opts = ort.SessionOptions() opts.inter_op_num_threads = 1 opts.intra_op_num_threads = 1 ` so the Transpose layer represents 15% of the total time cosuming, Is this layer generated by converting pytorch qat to onnx necessary?
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
I do a qat quantization on a cnn model, when a export it to onnx model, and got a slower inference than torchscript qat model. the result is torchscript: 4.798517942428589 ms onnxrumtime ort mode: 4.9489452838897705 ms I have checked the graph for all models, but found that there was inserted a transpose layer in onnx model
the torchscript model graph as following:
I think the transpose cost extra time when do inference.
To reproduce
the model define as: `m = nn.Sequential( nn.Conv2d(2,64,8), nn.ReLU(), nn.Conv2d(64, 128, 8), nn.ReLU(), )
"""Fuse""" torch.quantization.fuse_modules(m, ['0','1'], inplace=True) # fuse first Conv-ReLU pair torch.quantization.fuse_modules(m, ['2','3'], inplace=True) # fuse second Conv-ReLU pair
"""step3.Insert stubs""" m = nn.Sequential(torch.quantization.QuantStub(), *m, torch.quantization.DeQuantStub())
then do opt convert to ort mode
command = "python -m onnxruntime.quantization.preprocess --input model_int8.onnx --output model_int8_opt.onnx" process = subprocess.run(command, shell=True, stdout=subprocess.PIPE)command = "python -m onnxruntime.tools.convert_onnx_models_to_ort model_int8_opt.onnx" process = subprocess.run(command, shell=True, stdout=subprocess.PIPE) `
Urgency
No response
Platform
Linux
OS Version
3.10.0-1160.119.1.el7.x86_64
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
https://github.com/wangyunxiaa/model/blob/master/model_int8.onnx https://github.com/wangyunxiaa/model/blob/master/model_int8_opt.onnx https://github.com/wangyunxiaa/model/blob/master/model_int8_opt.ort
Is this a quantized model?
Yes