microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.52k stars 2.76k forks source link

[Performance] Huge gap between nn.Conv1d() and nn.Conv2d() - models exported by PyTorch #16047

Open ColainCYY opened 1 year ago

ColainCYY commented 1 year ago

Describe the issue

I have two CNN-models exported by Pytorch 1.13.0+cpu. They are strictly consistent in network structure except for convolution operations. Surprisingly the conv2d one takes 0.05ms to process a 10-token sequence, and the conv1d one takes 0.07ms. Within multi-threaded mode or longer inputs, the gap will be even larger.

Here is the original graphs: ori-models

My graph_optimization_level is default ORT_ENABLE_ALL. I have also set opts.optimized_model_filepath to get the optimized graphs: optim-models

The conv1d one is optimized into "FusedConv", while the conv2d one remains "Conv". I'm really confuesed. Is FusedConv rather slower than Conv?

To reproduce

Here is an demo to explain the difference between conv1d and conv2d.

// public part.
import torch.nn as nn
emb = nn.Embedding(100, 128)
x = torch.tensor([[0,1,2,3,4,5,6,7,8,9]])
e = emb(x) # torch.Size([1, 10, 128])

// for conv1d.
e1 = e.transpose(1,2) # torch.Size([1, 128, 10])
conv1 = nn.Conv1d(128, 128, 5)
y1 = conv1(e1) # torch.Size([1, 128, 6])

// for conv2d.
e2 = e.transpose(1,2).unsqueeze(2) # torch.Size([1, 128, 1, 10])
conv2 = nn.Conv2d(128, 128, (1,5))
y2 = conv2(e2) # torch.Size([1, 128, 1, 6])

Here is my code for test:

import onnxruntime as ort
import time
opts = ort.SessionOptions();
opts.intra_op_num_threads = 1
opts.inter_op_num_threads = 1
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

ort_session = ort.InferenceSession("conv1.onnx", sess_options=opts)
// ort_session = ort.InferenceSession("conv2.onnx", sess_options=opts)
input_ids = numpy.array([[2, 548, 470, 478, 474, 559, 548, 453, 421, 3]], dtype=numpy.int32)
print(ort_session.run(["probs"], {"input_ids": input_ids}))

t0 = time.time()
for i in range(0, 1000):
    outputs = ort_session.run(["probs"], {"input_ids": input_ids})
t1 = time.time()
print(t1-t0)

Urgency

No response

Platform

Windows

OS Version

Windows 10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

models.zip

Is this a quantized model?

No

duanqn commented 1 year ago

In the Conv2d model, the Conv is optimized to NCHWc Conv. If you click the black Conv operator you can see the domain is something like "com.microsoft". So it is a contrib op in ONNXRuntime rather than a standard ONNX operator. The NCHWc layout is an optimized memory layout which can make the Conv run faster. Look at the ReorderInput / ReorderOutput operators. You can see the memory layout being converted there. The conversion to NCHWc has some requirements. Apparently the 1d Conv layers do not match the requirements so they are not converted to NCHWc operators.

ColainCYY commented 1 year ago

In the Conv2d model, the Conv is optimized to NCHWc Conv. If you click the black Conv operator you can see the domain is something like "com.microsoft". So it is a contrib op in ONNXRuntime rather than a standard ONNX operator. The NCHWc layout is an optimized memory layout which can make the Conv run faster. Look at the ReorderInput / ReorderOutput operators. You can see the memory layout being converted there. The conversion to NCHWc has some requirements. Apparently the 1d Conv layers do not match the requirements so they are not converted to NCHWc operators.

Thanks very much!

ColainCYY commented 1 year ago

In the Conv2d model, the Conv is optimized to NCHWc Conv. If you click the black Conv operator you can see the domain is something like "com.microsoft". So it is a contrib op in ONNXRuntime rather than a standard ONNX operator. The NCHWc layout is an optimized memory layout which can make the Conv run faster. Look at the ReorderInput / ReorderOutput operators. You can see the memory layout being converted there. The conversion to NCHWc has some requirements. Apparently the 1d Conv layers do not match the requirements so they are not converted to NCHWc operators.

Can I force the Conv1d to be optimized to NCHWc? In other words, is it possible to run Conv1d as fast as Conv2d?

duanqn commented 1 year ago

Just promote it to Conv2d, if Conv1d is slow.