Open Gaggi72 opened 9 months ago
This should help. Try setting this run option. https://github.com/microsoft/onnxruntime/blob/e9ab56fa64c0644a2dc5287d1cd3b945bf7d7981/include/onnxruntime/core/session/onnxruntime_run_options_config_keys.h#L19-L27
@pranavsharma, Thank you very much. It works for me.
The example python code is here. https://github.com/microsoft/onnxruntime/blob/4bfa69def85476b33ccfaf68cf070f3fb65d39f7/onnxruntime/test/python/onnxruntime_test_python.py#L1586
By the way, the inference speed using this option is slower than what I expected.
Do you have any other options to avoid decreasing inference speed?
Shouldn't have impacted the inference speed. Can you tell where the extra time is being spent? Probably in the memory cleanup after the inferencing is complete?
Some decrease in speed is expected as the memory cleanup involves invoking multiple (in most cases) cudaFree()
calls and that cost is cooked into the Run()
call. To best use this feature, it is important to not allocate weights through the memory pool (arena) and set a high enough "inital" chunk size for the arena such that "most" Run()
calls can be serviced with that initial chunk and does not invoke shrinkage but if there are outlier Run()
calls that allocates more memory than initial chunk (some model that processes large sequnces for example) - that will be freed at the end of the Run()
. The "initial" chunk is never freed until the end of the session.
Please see detailed coment here - https://github.com/microsoft/onnxruntime/issues/9509#issuecomment-951546580
@pranavsharma yes, it is. @hariharans29 's answer is worth to me. I understood the context. I appreciate that.
Describe the issue
I am using the sentence-transformers model with onnx runtime for inferencing embeddings. I have created a FastAPI app on which app startup initialises the Inference session of onnx runtime. Whenever there are new tokens given for embedding creation it occupies GPU memory which is not released after successful execution. I have changed the gpu_mem_limit but still it exceeds it after k iterations. Also, io_bindings are being used for inferencing which get cleared at every API call. How I can free GPU memory
To reproduce
Urgency
No response
Platform
Linux
OS Version
20.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.16.3
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
11.8
Model File
No response
Is this a quantized model?
No