Open ccccjunkang opened 5 months ago
Each session creates a ThreadPool that is optimized to run on all cores. That TP is used for intra op parallelization. Meaning many CPU kernels attempt to distribute the work across all the cores for a give session.
Multiple sessions would create lots of contention and context switches.
You may want experiment with your model and set different number of threads for IntraOp thread pool. Or you can make all the sessions use a global threadpool. See if this can make things faster.
For CUDA though, disabling the threadpool altogether by setting the number of threads to 1 seems to be the best option for each of the sessions since CUDA kernels are not using CPU thread pools.
hi, which kernel do you have in your model? due to implementation reason, in ORT's cuda ep, we do have some kernels which currently doesn't support launch them in parallel, for example: the conv kernel
Just double check whether that is the case.
Thank you for your reply. @souptc In this case, there is no conv kernel in the model, matmul and elementwise kernel are used. The reason for using multiple sessions is to create multiple streams on the GPU, so it can launch kernel Concurrently. Is this lock caused by the cuda runtime. I set a cuda context for each thread to avoid the lock, but it does not take effect.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Hi, now I use multi ort session in one process(invoked by different threads), but it can not improve throughput. I had set a cuda context for each threads which invokes ort session.
there is an nsight profile timeline, it seems to be blocked by a rwlock when launch kernel.
onnxruntime version: 1.12.0