microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.87k stars 2.94k forks source link

GPU Memory allocation with multiple cuda stream #12920

Open Joeyzhouqihui opened 2 years ago

Joeyzhouqihui commented 2 years ago

Describe the issue

Hi, sorry for bothering!

I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)

I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)

I am wondering whether the same thing will happen with onnxruntime? Can different streams in onnxruntime reuse cached gpu memory?

I am looking forward for your reply! Thank you so much!

To reproduce

Nope

Urgency

No response

Platform

Linux

OS Version

Ubuntu 18.04.6 LTS

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

Latest

ONNX Runtime API

C++

Architecture

X86

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.3

tianleiwu commented 2 years ago

For onnxruntime, multiple inference sessions does not share cached GPU memory unless you provide your own allocator to them. We use Arena to cache memory for each inference session.

In ORT, memory allocation is not attached with cuda stream. We use cudaMalloc which is not stream-ordered. CUDA supports stream ordered allocation cudaMallocAsync, which is not used in ORT yet.