[Performance] Multithreading performance tails off after 3 threads, possible memory issue

Describe the issue

Hi all,

Apologies if this issue has been mentioned before, I've spent a while looking through old performance and multithreading related issues and can't seem to find the answer.

Issue: I have a LightGBM Classifier converted to ONNX and am trying to run inference from multiple threads using the Python runtime API.

These are the options I use to init the onnx runtime sessions.

import onnxruntime as ort
opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.inter_op_num_threads = 1
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

These options correctly limit the thread usage to 1 per thread which is what I want. The machine I'm using has 64 cores.

Here's the code I'm using to test:

from multiprocessing.pool import ThreadPool
import threading
import time

thread_count = 16

pool = ThreadPool(thread_count)
## Original pred
def f_python_onnx(test_x):
    print("Start Id", threading.get_native_id())
    for i in np.arange(500000):
        models['event'].run(None, {'input': test_x})[1]
    print("End Id", threading.get_native_id())

t0 = time.time()
pool.map(f_python_onnx, [test_x]*thread_count)
t_orig = time.time() - t0
print(t_orig)

Watching the system monitor, the correct things happens when thread_count is 3 or less. Any more than 3 and the performance gains really drop off. I've noticed that the memory usage seems to be capped no matter how many threads I use which I think is what the bottleneck is. The memory usage is always 7% of my total memory no matter how many threads I choose and I'm guessing this throttles the performance as I increase threads.

1 thread gives me 100% cpu usage 2 threads gives me 195% cpu usage 3 threads gives me 290% cpu usage 4 threads gives me 340% cpu usage And then it gets worse from there.

Anyone know how to fix this issue? My guess would be the solution is to somehow allocate more memory to the inference session but could be something else?

My current solution is to use multiprocessing and init the InferenceSession each time a new worker is spun up but this is slower than just using one inference session.

Thanks.

To reproduce

Some code shared in the description.

Urgency

Fairly urgent for me personally.

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

Latest

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

microsoft / onnxruntime