Open Joeyzhouqihui opened 2 years ago
For onnxruntime, multiple inference sessions does not share cached GPU memory unless you provide your own allocator to them. We use Arena to cache memory for each inference session.
In ORT, memory allocation is not attached with cuda stream. We use cudaMalloc which is not stream-ordered. CUDA supports stream ordered allocation cudaMallocAsync, which is not used in ORT yet.
Describe the issue
Hi, sorry for bothering!
I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)
I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)
I am wondering whether the same thing will happen with onnxruntime? Can different streams in onnxruntime reuse cached gpu memory?
I am looking forward for your reply! Thank you so much!
To reproduce
Nope
Urgency
No response
Platform
Linux
OS Version
Ubuntu 18.04.6 LTS
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
Latest
ONNX Runtime API
C++
Architecture
X86
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.3