[Performance] High memory use by CUDAProvider in Jetson Xavier NX(JetPack 4.4)

Describe the issue

Hi there,I'm a beginner so please forgive me if I say some silly words.

I'm trying to deploy multiple models on Jetson Xavier NX for real time detect.When I add to three or more models, memory is all used then my program is not responding.

Through jtop and mprof I found if using CPUProvider , each model only use about 300MiB.But using CUDAProvider, each model need nearly 1750MiB.In that case , if load more than two models, there is no memory can be used.

Is this situation normal?For real time detect,I must use multiprocess to decoupling video capture and video process, so I can't let cpu occupancy rate too high.Is there any way to reduce the memory occupancy with CUDAProvider?

btw swap space has opened through jtop but no use.Model is yolov5 to onnx and simplified.Model size about 27.1M

To reproduce

# This is my class for loading model
class YOLOV5:
    def __init__(self, train_model_path):
        self.onnx_session = onnxruntime.InferenceSession(train_model_path, providers=['CPUExecutionProvider'])
        # self.onnx_session = onnxruntime.InferenceSession(train_model_path, providers=['CUDAExecutionProvider'])
        self.input_name = self.get_input_name()
        self.output_name = self.get_output_name()

    def get_input_name(self):
        input_name = []
        for node in self.onnx_session.get_inputs():
            input_name.append(node.name)
        return input_name

    def get_output_name(self):
        output_name = []
        for node in self.onnx_session.get_outputs():
            output_name.append(node.name)
        return output_name

    def get_input_feed(self, img_tensor):
        input_feed = {}
        for name in self.input_name:
            input_feed[name] = img_tensor
        return input_feed

    def inference(self, input_img):
        # img = cv2.imread(img_path)
        resize_img = cv2.resize(input_img, (640, 640))
        input_img = resize_img[:, :, ::-1].transpose(2, 0, 1)  # BGR2RGB and HWC2CHW
        input_img = input_img.astype(dtype=np.float32)
        input_img /= 255.0
        input_img = np.expand_dims(input_img, axis=0)
        input_feed = self.get_input_feed(input_img)
        pred = self.onnx_session.run(None, input_feed)[0]
        return pred, resize_img

Urgency

No response

Platform

Linux

OS Version

ubuntu 18.04LTS

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime_gpu-1.10.0-cp36-cp36m-linux_aarch64.whl

ONNX Runtime API

Python

Architecture

Other / Unknown

Execution Provider

CUDA

Execution Provider Library Version

CUDA10.2

Model File

test_simplied.zip

Is this a quantized model?

Unknown

There are a few session options might help: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html

gpu_mem_limit: set some limt. arena_extend_strategy: kSameAsRequested (1) cudnn_conv_algo_search: HEURISTIC (1) or DEFAULT (2) cudnn_conv_use_max_workspace: 0

For example, set arena_extend_strategy to SameAsRequested could avoid allocating memory more than needed. Use a few images to warm up the service. Note that, first image might apply cudnn conv algo search, which might need a lot of workspace memory. Change cudnn_conv_algo_search and cudnn_conv_use_max_workspace could reduce memory usage (it could also impact speed since only a subset of algo is searched). I think gpu_mem_limit might also impact cudnn workspace.

Run option of memory.enable_memory_arena_shrinkage could be used to shrink arena memory, see example.

Try the following sequences to see whether it could help.

run a warm up image (this will start conv algo search).
use memory.enable_memory_arena_shrinkage only once and run another warm up image (If memory usage is reduced, cause is cuDNN workspace can be reduced).
run another warm up image (without arena shrinkage). Memory usage shall not increase.
serve real traffic (without arena shrinkage).

microsoft / onnxruntime