microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.18k stars 2.86k forks source link

[Performance] WebAssembly 1x1 Conv almost 4x slower than native #15483

Open starsky opened 1 year ago

starsky commented 1 year ago

Describe the issue

I noticed am trying to optimize my models for WebAssembly ONNX Runtime. I ran some test regarding the Conv operation speed difference between Web and Native ONNX Runtime.

I create a model that does 1x1 conv. And progressively add more 1x1 conv layers from 1 to 50. I measure inference time for native and WebAssembly. I estimated that on my machine some constatnt operations (eg. data loading) are ~ 0.17 ms vs 0.3 ms for web.

But time for single 1x1 conv layer is 0.026 ms for native vs 0.1 ms for web. Whis is almost 4x slower. Is it expected? Or are there ways to improve the speed? Cause model is very simple, and I used ONNX Simplifier to optimize the model. I struggle to find what kind of lost of performance is expected in documentation.

To reproduce

example.onnx.gz

Here is 50 layer conv 1x1 model. In my case it is 3.75 times slower on web than in native ONNX runtime.

I use this code to run the model on web: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/quick-start_onnxruntime-web-script-tag

Urgency

No response

Platform

Web Browser

OS Version

Linux

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

ONNX Runtime Web v1.14.0

ONNX Runtime API

JavaScript

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

wschin commented 1 year ago

It's possible that the underlying code is slower on simple or small shapes. Does your targeted model have 1x1 convolution? Or 1x1 convolution is just created for test? If so, I'd suggest focusing on convolution op in the targeted model.

starsky commented 1 year ago

@wschin I did the same experiment with 3x3 convs. And I got similar speed difference (4x slower). I did test for 1x1 and 3x3 convs, because they are common building blocks for most of the architectures. Also the the size of the input tensor is 256x144 which is a realistic input size.

guschmue commented 1 year ago

webassembly is slower than native. Depends a lot on what you do but 3-5 is common. For conv the main reasons are that simd in wasm is 128-bit and much more effort has gone into optimizing the conv in native. We will be doing some work to address the later soon. But I'm not sure how much we can get out of this - I'd expect a factor of 2-3 slower.