microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.09k stars 2.84k forks source link

[Performance] Performance degradation while using dynamic axes #14863

Open ml5ah opened 1 year ago

ml5ah commented 1 year ago

Describe the issue

I have an object detection model that's trained using pytorch and exported to onnx. By default batch size for inference is set to 1 and see inference time of 720ms.

As an experiment to see if increasing the batch size helps performance, we used dynamic axes and set the batch size at runtime. However, the per inference performance actually decreases to 800ms.

Is it expected? If no, can you share some guidelines for ways to speed up inference on cpu?

To reproduce

The inference is being done using onnxruntime 1.9.0 in Java.

Urgency

No response

Platform

Windows

OS Version

10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.9.0

ONNX Runtime API

Java

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

Craigacp commented 1 year ago

How are you preparing the batches on the Java side? The multidimensional array support is slow so it's better to use a direct FloatBuffer to get data into ORT, and use the get buffer method on the tensors on the way back out.

Otherwise there can be differences in how ORT optimizes things for dynamic batch sizes.

ml5ah commented 1 year ago

@Craigacp As of now, I am using multidimensional array. I see, thanks for the suggestion - will try out using FloatBuffer as input/output for OrtSession. This might be a difficult question, but as compare to running 1 inference at a time using multidimensional array, is there a rough idea of speedup we should see?

Thanks a lot for your response!

Craigacp commented 1 year ago

That's a function of the model, data size and hardware, so I'm not sure how to answer it. Bigger batches increase computational intensity which is helpful as single example batches are usually bottlenecked on memory bandwidth for loading the models and you can compute a lot of floats while waiting for the data load from RAM into cache. But how much better is very context dependent.

ml5ah commented 1 year ago

Yep that makes total sense, just wanted to check if there was some indication. Thanks again for the help. I'll post the outcome here after giving it a try.

ml5ah commented 1 year ago

Hi @Craigacp, thanks for the valuable suggestion! We saw a marked increase in inference speed with this change. I saw several other threads as well recommending this change.

I was wondering if this could be in the official onnxruntime documentation as one of the best practices. @pranavsharma.

Craigacp commented 1 year ago

I'm working on a Java port of the C# Stable Diffusion tutorial, and once that's released I can do a similar tutorial writeup which makes best practices clear. My aim is to keep the stable diffusion example up to date with ORT Java as we bring in features like output pinning and IOBinding, but the open source process is taking me a little while and I don't have bandwidth for other ORT work until that's done.