GPU Memory allocation with multiple cuda stream

Describe the issue

Hi, sorry for bothering!

I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)

I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)

I am wondering whether the same thing will happen with onnxruntime? Can different streams in onnxruntime reuse cached gpu memory?

I am looking forward for your reply! Thank you so much!

To reproduce

Nope

Urgency

No response

Platform

Linux

OS Version

Ubuntu 18.04.6 LTS

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

Latest

ONNX Runtime API

C++

Architecture

X86

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.3

microsoft / onnxruntime