Open insightless opened 2 years ago
You want to inference multiple models in parallel across multiple GPUs. To achieve this you need a session per model with each session tied to a separate GPU device id. How many sessions are you creating? Have you tried limiting the # of threads to 1 per session by setting intra_op_num_threads? Assuming the GPU is going to execute the full or a large part of the model, setting this to 1 shouldn't matter for perf.
I'm creating anywhere between 1-5 sessions per GPU depending on the model and available VRAM. I've tried limiting the number of threads per session to 1 but there still seems to be a performance hit. The models are being used for batch semantic inference, so each session uses a tensor of multiple images.
With 5 sessions, only 5 threads will get created in the session threadpools. As far as allocators go, the maximum memory consumption comes from the arena and you can disable this for CPU by setting DisableCpuMemArena on each of the sessions. The GPU based arenas are not shared. If this doesn't resolve the issue, let us know what specifically you're observing.
Is your feature request related to a problem? Please describe. We have a .NET application that needs to run multiple models in parallel across multiple (or the same) GPUs within one process. Currently this causes thread and resource blockage and can lead to crashes if too many sessions are started in the same process.
System information ONNX Runtime v1.12.1 on Windows
Describe the solution you'd like The C API has the ability to use a global/shared threadpool across and shared allocator multiple InferenceSessions within the same process. The C# API references this IntPtr in the OrtApi struct but there is no function or way to use it. If these two things could be enabled in a function, property, or class this would greatly increase performance and stability in scenarios like the one described above.
Describe alternatives you've considered Writing a custom wrapper for that function, opening multiple processes of application (not ideal), or moving to a different machine learning package.