microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.66k stars 2.93k forks source link

how to release gpu memory when keep onnxruntime session around. #9509

Open Z-XQ opened 3 years ago

Z-XQ commented 3 years ago

I want to release GPU memory in time and keep the session running. Thank you!

hariharans29 commented 3 years ago

Can you please elaborate about your scenario ? What exactly do you mean by "release GPU memory in time and keep the session running" ? Do you mean you want to shrink any GPU memory arena associated with the session periodically while still keeping the session alive ?

Z-XQ commented 3 years ago

Can you please elaborate about your scenario ? What exactly do you mean by "release GPU memory in time and keep the session running" ? Do you mean you want to shrink any GPU memory arena associated with the session periodically while still keeping the session alive ?

Thank you for your reply! The following is my scenario.

My gpu is 3090. 708M gpu memory is used before open an onnxruntime session. Then I use the following to open a session. ort_session = onnxruntime.InferenceSession(model_path) The gpu memory becomes used about 1.7g

When infers one image as following, the gpu memory becomes used about 2.0g. And the number will not decline when the infererence operation is over. seg_raw_output = tmp.run([], {self.seg_input_name: seg_input_data})[0]

Therefore, I want to release some gpu memory occupation to make the occupation return to 1.7g while still keeping the session alive.

Thank you! I am looking forward to your reply!

hariharans29 commented 3 years ago

Thanks @Z-XQ for the explanation.

The GPU memory is backed by a memory pool (arena) and we have a config knob to shrink the arena (de-allocated unused memory chunks).

Not sure if we have enough tools to accomplish this in Python just yet. The best way to use this feature in C++ is to:

1) Not allocate weights memory through the arena: See here

2) Configure the arena to have high enough initial chunk to support most Run() calls. See "initial_chunk_size_bytes" here

3) Finally, configure the arena to shrink on every Run(). See here. This will keep the initial chunk allocated but de-allocate any unused chunk remaining after the Run() call ends.

For example, if the initial chunk size is set as 500MB, the first Run() will allocate 500MB + any additional chunks required to service the Run() call. The additional chunks will get de-allocated after Run() and only keep 500MB of memory allocated. It is important to not allocate weights (initializers) memory through the arena as that complicates the shrinkage. Hence, step (1).

Z-XQ commented 3 years ago

Thanks @Z-XQ for the explanation.

The GPU memory is backed by a memory pool (arena) and we have a config knob to shrink the arena (de-allocated unused memory chunks).

Not sure if we have enough tools to accomplish this in Python just yet. The best way to use this feature in C++ is to:

  1. Not allocate weights memory through the arena: See here
  2. Configure the arena to have high enough initial chunk to support most Run() calls. See "initial_chunk_size_bytes" here
  3. Finally, configure the arena to shrink on every Run(). See here. This will keep the initial chunk allocated but de-allocate any unused chunk remaining after the Run() call ends.

For example, if the initial chunk size is set as 500MB, the first Run() will allocate 500MB + any additional chunks required to service the Run() call. The additional chunks will get de-allocated after Run() and only keep 500MB of memory allocated. It is important to not allocate weights (initializers) memory through the arena as that complicates the shrinkage. Hence, step (1).

Thanks a lot! It's too hard to convert to c++ deployment in a short time. I'll figure out other ways by using python code.

hariharans29 commented 3 years ago

Thanks. We will need to support configuring the arena in Python. So, I will mark it an enhancement.

zlbdzhh commented 2 years ago

When I operated according to what you said, I reported this error in multi gpu "Did not find an arena based allocator registered for device-id combination in the memory arena shrink list: gpu:1",how can i do it

codender commented 2 years ago

Thanks. We will need to support configuring the arena in Python. So, I will mark it an enhancement.

Hi, can i release gpu mem in python now?

chinmayjog13 commented 1 year ago

Hi, any update on this issue?

yingfenging commented 1 year ago

@hariharans29 Hi, can i release gpu mem in python now?

bluddy commented 1 year ago

I'd like to know if there is a way to limit the growth of memory usage, especially on the GPU. We're making serious usage of onnxruntime and need to know if we can rely on it in a python-based system.

dawenxi-only commented 2 months ago

Is there any update? I am also facing the issue of freeing memory from onnxruntime in Python

yiluzhuimeng commented 1 month ago

Is there any update? I am also facing the issue of freeing memory from onnxruntime in Python,too