use multi ort session in one process, can not improve throughput

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.26k stars 2.87k forks source link

use multi ort session in one process, can not improve throughput #20494

Open ccccjunkang opened 5 months ago

ccccjunkang commented 5 months ago

Hi, now I use multi ort session in one process(invoked by different threads）, but it can not improve throughput. I had set a cuda context for each threads which invokes ort session.

      CUcontext context;
      cuCtxCreate(&context, 0, 0);
      cuCtxSetCurrent(context);

there is an nsight profile timeline, it seems to be blocked by a rwlock when launch kernel.

onnxruntime version: 1.12.0

ccccjunkang commented 5 months ago

yuslepukhin commented 5 months ago

Each session creates a ThreadPool that is optimized to run on all cores. That TP is used for intra op parallelization. Meaning many CPU kernels attempt to distribute the work across all the cores for a give session.

Multiple sessions would create lots of contention and context switches.

You may want experiment with your model and set different number of threads for IntraOp thread pool. Or you can make all the sessions use a global threadpool. See if this can make things faster.

For CUDA though, disabling the threadpool altogether by setting the number of threads to 1 seems to be the best option for each of the sessions since CUDA kernels are not using CPU thread pools.

souptc commented 5 months ago

hi, which kernel do you have in your model? due to implementation reason, in ORT's cuda ep, we do have some kernels which currently doesn't support launch them in parallel, for example: the conv kernel

Just double check whether that is the case.

ccccjunkang commented 5 months ago

Thank you for your reply. @souptc In this case, there is no conv kernel in the model, matmul and elementwise kernel are used. The reason for using multiple sessions is to create multiple streams on the GPU, so it can launch kernel Concurrently. Is this lock caused by the cuda runtime. I set a cuda context for each thread to avoid the lock, but it does not take effect.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.