Open ml5ah opened 1 year ago
How are you preparing the batches on the Java side? The multidimensional array support is slow so it's better to use a direct FloatBuffer
to get data into ORT, and use the get buffer method on the tensors on the way back out.
Otherwise there can be differences in how ORT optimizes things for dynamic batch sizes.
@Craigacp As of now, I am using multidimensional array. I see, thanks for the suggestion - will try out using FloatBuffer as input/output for OrtSession. This might be a difficult question, but as compare to running 1 inference at a time using multidimensional array, is there a rough idea of speedup we should see?
Thanks a lot for your response!
That's a function of the model, data size and hardware, so I'm not sure how to answer it. Bigger batches increase computational intensity which is helpful as single example batches are usually bottlenecked on memory bandwidth for loading the models and you can compute a lot of floats while waiting for the data load from RAM into cache. But how much better is very context dependent.
Yep that makes total sense, just wanted to check if there was some indication. Thanks again for the help. I'll post the outcome here after giving it a try.
Hi @Craigacp, thanks for the valuable suggestion! We saw a marked increase in inference speed with this change. I saw several other threads as well recommending this change.
I was wondering if this could be in the official onnxruntime documentation as one of the best practices. @pranavsharma.
I'm working on a Java port of the C# Stable Diffusion tutorial, and once that's released I can do a similar tutorial writeup which makes best practices clear. My aim is to keep the stable diffusion example up to date with ORT Java as we bring in features like output pinning and IOBinding, but the open source process is taking me a little while and I don't have bandwidth for other ORT work until that's done.
Describe the issue
I have an object detection model that's trained using pytorch and exported to onnx. By default batch size for inference is set to 1 and see inference time of 720ms.
As an experiment to see if increasing the batch size helps performance, we used dynamic axes and set the batch size at runtime. However, the per inference performance actually decreases to 800ms.
Is it expected? If no, can you share some guidelines for ways to speed up inference on cpu?
To reproduce
The inference is being done using onnxruntime 1.9.0 in Java.
Urgency
No response
Platform
Windows
OS Version
10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.9.0
ONNX Runtime API
Java
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No