Apologies if this issue has been mentioned before, I've spent a while looking through old performance and multithreading related issues and can't seem to find the answer.
Issue:
I have a LightGBM Classifier converted to ONNX and am trying to run inference from multiple threads using the Python runtime API.
These are the options I use to init the onnx runtime sessions.
import onnxruntime as ort
opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.inter_op_num_threads = 1
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
These options correctly limit the thread usage to 1 per thread which is what I want. The machine I'm using has 64 cores.
Here's the code I'm using to test:
from multiprocessing.pool import ThreadPool
import threading
import time
thread_count = 16
pool = ThreadPool(thread_count)
## Original pred
def f_python_onnx(test_x):
print("Start Id", threading.get_native_id())
for i in np.arange(500000):
models['event'].run(None, {'input': test_x})[1]
print("End Id", threading.get_native_id())
t0 = time.time()
pool.map(f_python_onnx, [test_x]*thread_count)
t_orig = time.time() - t0
print(t_orig)
Watching the system monitor, the correct things happens when thread_count is 3 or less. Any more than 3 and the performance gains really drop off. I've noticed that the memory usage seems to be capped no matter how many threads I use which I think is what the bottleneck is. The memory usage is always 7% of my total memory no matter how many threads I choose and I'm guessing this throttles the performance as I increase threads.
1 thread gives me 100% cpu usage
2 threads gives me 195% cpu usage
3 threads gives me 290% cpu usage
4 threads gives me 340% cpu usage
And then it gets worse from there.
Anyone know how to fix this issue? My guess would be the solution is to somehow allocate more memory to the inference session but could be something else?
My current solution is to use multiprocessing and init the InferenceSession each time a new worker is spun up but this is slower than just using one inference session.
Describe the issue
Hi all,
Apologies if this issue has been mentioned before, I've spent a while looking through old performance and multithreading related issues and can't seem to find the answer.
Issue: I have a LightGBM Classifier converted to ONNX and am trying to run inference from multiple threads using the Python runtime API.
These are the options I use to init the onnx runtime sessions.
These options correctly limit the thread usage to 1 per thread which is what I want. The machine I'm using has 64 cores.
Here's the code I'm using to test:
Watching the system monitor, the correct things happens when thread_count is 3 or less. Any more than 3 and the performance gains really drop off. I've noticed that the memory usage seems to be capped no matter how many threads I use which I think is what the bottleneck is. The memory usage is always 7% of my total memory no matter how many threads I choose and I'm guessing this throttles the performance as I increase threads.
1 thread gives me 100% cpu usage 2 threads gives me 195% cpu usage 3 threads gives me 290% cpu usage 4 threads gives me 340% cpu usage And then it gets worse from there.
Anyone know how to fix this issue? My guess would be the solution is to somehow allocate more memory to the inference session but could be something else?
My current solution is to use multiprocessing and init the InferenceSession each time a new worker is spun up but this is slower than just using one inference session.
Thanks.
To reproduce
Some code shared in the description.
Urgency
Fairly urgent for me personally.
Platform
Linux
OS Version
Ubuntu 20.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
Latest
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No