microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.22k stars 2.87k forks source link

[Performance] In ONNX Runtime, the CPU consumption does not scale linearly with the number of threads #19384

Open bluishwhite opened 8 months ago

bluishwhite commented 8 months ago

Hello, I have meet a problem in C++ onnxruntime。

The program has only one onnx model, when the threads up, the program will creat a new session->run(). In the program, I found that when I have 4 threads to deal with the 4 requests , it cost 1cpu with rft 1.0. When limiting the CPU cores to 4 and using 16 threads to handle 16 requests, the RTF ranges from 2.19 -3.7, the avarage rtf is around 3.2. the session options is : session_options_.SetIntraOpNumThreads(1);

Refer the issue: OnnxRuntime multithreading efficiency is poor I change the session option to
session_options_.SetIntraOpNumThreads(1); session_options_.SetInterOpNumThreads(1); session_options_.DisableMemPattern(); session_options_.SetExecutionMode(ORT_SEQUENTIAL); The avarage rtf is aroud 2.4.

The deadline is looming, and time is running out for me 😢 How can I further optimize to achieve a more linear relationship between CPU consumption and concurrency? The ideal RTF is roud 1.0.(16 threads to handle 16 requests with 4 cpu)

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Version or Commit ID

1.12.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

### Tasks
yufenglee commented 8 months ago

As you only have 4 cores, why do you create 16 threads?

pranavsharma commented 8 months ago

First, you're using a version of ORT that is 4 releases old. Second, as Yufeng said above, it's not clear why you've 16 threads on a 4 core machine. What is rtf?

bluishwhite commented 8 months ago

Thanks for your reply. @yufenglee Constrained by resources, I aim to utilize as few CPUs as possible to support as many concurrent threads as feasible. Upon testing my program, one CPU can accommodate 4 threads. If the relationship between CPUs and threads is proportional, then 4 CPUs can sustain 16 threads.

Besides, I found that when I use docker to creat a few container to run my onnxruntime program in different processor with differe cpu core id. As the number of containers increases, the CPU load among these containers will mutually influence each other. When I have two container, the cpu usage of every containe is around 80%, when have three container, the cpu usage of every container is 90%, when I have 4, the cpu usage is round 100%.

@pranavsharma RTF(Real Time Factor) = total_audio / total_time_taken, which is served as a performance evaluation metric. The lower RTF, the better performance. Yeah, the version of onnxruntime is too old. I will change the a new version. Thanks for you.

poor1017 commented 8 months ago

@yufenglee Hi, We encountered a similar problem. We bound a container A running onnxruntime program to a certain CPU processor, and another container B running the same onnxruntime program to another CPU processor. If container A or container B runs alone, their CPU load remains at about 50%, but if they run at the same time, their CPU load rises to 80%.

We measured the CPU cycles of session.Run() and found that it was the main cause of increased CPU load.

Are there any Ort configuration options that can eliminate this impact between containers?

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

radikalliberal commented 6 months ago

Hi, before this issue gets closed. I also have the same problem. When running many threads at the same time session.run is slow. I thought it might have something to do with memory allocation for the input tensors but I could rule that out. There is some kind of synchronization in session.run . Can somebody of the dev team tell us why this is necessary?

poor1017 commented 6 months ago

Hi, before this issue gets closed. I also have the same problem. When running many threads at the same time session.run is slow. I thought it might have something to do with memory allocation for the input tensors but I could rule that out. There is some kind of synchronization in session.run . Can somebody of the dev team tell us why this is necessary?

In my situation, it's due to NUMA architecture. Session option may help, such as enable_spinning_lock.

radikalliberal commented 6 months ago

thanks @poor1017 that was a great hint. I think you are right this is NUMA. Im running ORT under C++ and when I create the session in the thread its executed in the forwardtimes reduce significantly and are almost on par with single threaded performance. My suggestion is to try to create each session in its dedicated thread and only run it in the same thread.

radikalliberal commented 6 months ago

Hi, I have to revise my answer. Measuring multithreaded performance was not as straight forward as i had exprected. Higher forward times may have occurde due to CPUs beeing throttled because they were idle beforhand. When we instantiate the session in the thread, throttling has already stopped. So this seems not to be a memory issue. I was not able to measure signifanct differences when allocating in a the main thread and then forwarding in another.