microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.15k stars 2.86k forks source link

Enable Global Shared Threadpool and Memory Allocator For C# #12654

Open insightless opened 2 years ago

insightless commented 2 years ago

Is your feature request related to a problem? Please describe. We have a .NET application that needs to run multiple models in parallel across multiple (or the same) GPUs within one process. Currently this causes thread and resource blockage and can lead to crashes if too many sessions are started in the same process.

System information ONNX Runtime v1.12.1 on Windows

Describe the solution you'd like The C API has the ability to use a global/shared threadpool across and shared allocator multiple InferenceSessions within the same process. The C# API references this IntPtr in the OrtApi struct but there is no function or way to use it. If these two things could be enabled in a function, property, or class this would greatly increase performance and stability in scenarios like the one described above.

Describe alternatives you've considered Writing a custom wrapper for that function, opening multiple processes of application (not ideal), or moving to a different machine learning package.

pranavsharma commented 2 years ago

You want to inference multiple models in parallel across multiple GPUs. To achieve this you need a session per model with each session tied to a separate GPU device id. How many sessions are you creating? Have you tried limiting the # of threads to 1 per session by setting intra_op_num_threads? Assuming the GPU is going to execute the full or a large part of the model, setting this to 1 shouldn't matter for perf.

insightless commented 2 years ago

I'm creating anywhere between 1-5 sessions per GPU depending on the model and available VRAM. I've tried limiting the number of threads per session to 1 but there still seems to be a performance hit. The models are being used for batch semantic inference, so each session uses a tensor of multiple images.

pranavsharma commented 2 years ago

With 5 sessions, only 5 threads will get created in the session threadpools. As far as allocators go, the maximum memory consumption comes from the arena and you can disable this for CPU by setting DisableCpuMemArena on each of the sessions. The GPU based arenas are not shared. If this doesn't resolve the issue, let us know what specifically you're observing.