Open Z-XQ opened 3 years ago
Can you please elaborate about your scenario ? What exactly do you mean by "release GPU memory in time and keep the session running" ? Do you mean you want to shrink any GPU memory arena associated with the session periodically while still keeping the session alive ?
Can you please elaborate about your scenario ? What exactly do you mean by "release GPU memory in time and keep the session running" ? Do you mean you want to shrink any GPU memory arena associated with the session periodically while still keeping the session alive ?
Thank you for your reply! The following is my scenario.
My gpu is 3090. 708M gpu memory is used before open an onnxruntime session. Then I use the following to open a session. ort_session = onnxruntime.InferenceSession(model_path) The gpu memory becomes used about 1.7g
When infers one image as following, the gpu memory becomes used about 2.0g. And the number will not decline when the infererence operation is over. seg_raw_output = tmp.run([], {self.seg_input_name: seg_input_data})[0]
Therefore, I want to release some gpu memory occupation to make the occupation return to 1.7g while still keeping the session alive.
Thank you! I am looking forward to your reply!
Thanks @Z-XQ for the explanation.
The GPU memory is backed by a memory pool (arena) and we have a config knob to shrink the arena (de-allocated unused memory chunks).
Not sure if we have enough tools to accomplish this in Python just yet. The best way to use this feature in C++ is to:
1) Not allocate weights memory through the arena: See here
2) Configure the arena to have high enough initial chunk to support most Run() calls. See "initial_chunk_size_bytes" here
3) Finally, configure the arena to shrink on every Run(). See here. This will keep the initial chunk allocated but de-allocate any unused chunk remaining after the Run() call ends.
For example, if the initial chunk size is set as 500MB, the first Run() will allocate 500MB + any additional chunks required to service the Run() call. The additional chunks will get de-allocated after Run() and only keep 500MB of memory allocated. It is important to not allocate weights (initializers) memory through the arena as that complicates the shrinkage. Hence, step (1).
Thanks @Z-XQ for the explanation.
The GPU memory is backed by a memory pool (arena) and we have a config knob to shrink the arena (de-allocated unused memory chunks).
Not sure if we have enough tools to accomplish this in Python just yet. The best way to use this feature in C++ is to:
- Not allocate weights memory through the arena: See here
- Configure the arena to have high enough initial chunk to support most Run() calls. See "initial_chunk_size_bytes" here
- Finally, configure the arena to shrink on every Run(). See here. This will keep the initial chunk allocated but de-allocate any unused chunk remaining after the Run() call ends.
For example, if the initial chunk size is set as 500MB, the first Run() will allocate 500MB + any additional chunks required to service the Run() call. The additional chunks will get de-allocated after Run() and only keep 500MB of memory allocated. It is important to not allocate weights (initializers) memory through the arena as that complicates the shrinkage. Hence, step (1).
Thanks a lot! It's too hard to convert to c++ deployment in a short time. I'll figure out other ways by using python code.
Thanks. We will need to support configuring the arena in Python. So, I will mark it an enhancement.
When I operated according to what you said, I reported this error in multi gpu "Did not find an arena based allocator registered for device-id combination in the memory arena shrink list: gpu:1",how can i do it
Thanks. We will need to support configuring the arena in Python. So, I will mark it an enhancement.
Hi, can i release gpu mem in python now?
Hi, any update on this issue?
@hariharans29 Hi, can i release gpu mem in python now?
I'd like to know if there is a way to limit the growth of memory usage, especially on the GPU. We're making serious usage of onnxruntime and need to know if we can rely on it in a python-based system.
Is there any update? I am also facing the issue of freeing memory from onnxruntime in Python
Is there any update? I am also facing the issue of freeing memory from onnxruntime in Python,too
I want to release GPU memory in time and keep the session running. Thank you!