Open ColainCYY opened 1 year ago
In the Conv2d model, the Conv is optimized to NCHWc Conv. If you click the black Conv operator you can see the domain is something like "com.microsoft". So it is a contrib op in ONNXRuntime rather than a standard ONNX operator. The NCHWc layout is an optimized memory layout which can make the Conv run faster. Look at the ReorderInput / ReorderOutput operators. You can see the memory layout being converted there. The conversion to NCHWc has some requirements. Apparently the 1d Conv layers do not match the requirements so they are not converted to NCHWc operators.
In the Conv2d model, the Conv is optimized to NCHWc Conv. If you click the black Conv operator you can see the domain is something like "com.microsoft". So it is a contrib op in ONNXRuntime rather than a standard ONNX operator. The NCHWc layout is an optimized memory layout which can make the Conv run faster. Look at the ReorderInput / ReorderOutput operators. You can see the memory layout being converted there. The conversion to NCHWc has some requirements. Apparently the 1d Conv layers do not match the requirements so they are not converted to NCHWc operators.
Thanks very much!
In the Conv2d model, the Conv is optimized to NCHWc Conv. If you click the black Conv operator you can see the domain is something like "com.microsoft". So it is a contrib op in ONNXRuntime rather than a standard ONNX operator. The NCHWc layout is an optimized memory layout which can make the Conv run faster. Look at the ReorderInput / ReorderOutput operators. You can see the memory layout being converted there. The conversion to NCHWc has some requirements. Apparently the 1d Conv layers do not match the requirements so they are not converted to NCHWc operators.
Can I force the Conv1d to be optimized to NCHWc? In other words, is it possible to run Conv1d as fast as Conv2d?
Just promote it to Conv2d, if Conv1d is slow.
Describe the issue
I have two CNN-models exported by Pytorch 1.13.0+cpu. They are strictly consistent in network structure except for convolution operations. Surprisingly the conv2d one takes 0.05ms to process a 10-token sequence, and the conv1d one takes 0.07ms. Within multi-threaded mode or longer inputs, the gap will be even larger.
Here is the original graphs:![ori-models](https://github.com/microsoft/onnxruntime/assets/30071492/05e7107b-388c-40e6-aa13-1c6dbe0c8a7f)
My graph_optimization_level is default ORT_ENABLE_ALL. I have also set opts.optimized_model_filepath to get the optimized graphs:![optim-models](https://github.com/microsoft/onnxruntime/assets/30071492/be2f1db9-57dd-4606-9697-b2ab95ca8d1b)
The conv1d one is optimized into "FusedConv", while the conv2d one remains "Conv". I'm really confuesed. Is FusedConv rather slower than Conv?
To reproduce
Here is an demo to explain the difference between conv1d and conv2d.
Here is my code for test:
Urgency
No response
Platform
Windows
OS Version
Windows 10
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.14.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
models.zip
Is this a quantized model?
No