Open nistarlwc opened 2 years ago
Are you creating different InferenceSession instances in each thread? That would cause a lot of excess memory usage as each session has its own memory pool. An InferenceSession is stateless and can be called concurrently from multiple threads.
Are you creating different InferenceSession instances in each thread? That would cause a lot of excess memory usage as each session has its own memory pool. An InferenceSession is stateless and can be called concurrently from multiple threads.
Create 2 InferenceSession as global value.
from flask import Flask, request
app = Flask(__name__)
sess1 = rt.InferenceSession(model1)
sess2 = rt.InferenceSession(model2)
@app.route('/algorithm', methods=['POST'])
def parser():
img = cv2.imread(...)
prediction1 = sess1.run(...)
prediction2 = sess2.run(...)
if __name__ == '__main__':
app.run(host='127.0.0.1', port='12345', threaded=True)
And when the memory is greater than 8G, have error:
onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED ; GPU=0 ; hostname=LAPTOP-D5571KG6 ; expr=cublasCreate(&cublas_handle_);
onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 18874368
how should it be modified?
Are you creating different InferenceSession instances in each thread? That would cause a lot of excess memory usage as each session has its own memory pool. An InferenceSession is stateless and can be called concurrently from multiple threads.
I try to create one InferenceSession in threads, but fail
import onnxruntime as rt
from flask import Flask, request
import threading
app = Flask(__name__)
class Singleton(object):
_instance_lock = threading.Lock()
def __init__(self):
self.sess = rt.InferenceSession(model1, providers=['CUDAExecutionProvider'])
def __new__(cls, *args, **kwargs):
if not hasattr(Singleton, "_instance"):
with Singleton._instance_lock:
if not hasattr(Singleton, "_instance"):
Singleton._instance = object.__new__(cls, *args, **kwargs)
return Singleton._instance
sess1 = Singleton()
@app.route('/algorithm', methods=['POST'])
def parser():
prediction1 = sess1.sess.run(...)
if __name__ == '__main__':
app.run(host='127.0.0.1', port='12345', threaded=True)
You could try limiting the memory size used by CUDA for each session by setting gpu_mem_limit: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#python
onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 18874368
However this is only an 18MB allocation. The way the arena works is to grow in chunks at least 2x the size of the last chunk. If it's only up to 18MB it's not being used much. Is something besides ORT also using GPU memory?
Another thing to try would be to set the logging severity to INFO level and see what sort of memory requirements each model has. I would do this separately so the output is clear. Look for output from bfs_arena.cc
import onnxruntime as ort
so = ort.SessionOptions()
so.log_severity_level = 1 # INFO level
s = ort.InferenceSession('model.onnx', so, ort.get_available_providers())
When running the models does the input to each model have the same size each time? If not, disabling the memory pattern planner might help. Set enable_mem_pattern
in SessionOptions to False to do that.
I user python API,"gpu_mem_limit" not work . I also use flask to build services . During the compression test, the gpu usage is also improved concurrently。
self.sess_options = onnxruntime.SessionOptions()
self.sess_options.intra_op_num_threads = 10
self.sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_PARALLEL
self.sess_options.inter_op_num_threads = 10
self.sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
providers = [
('CUDAExecutionProvider', {
'device_id': 0,
'gpu_mem_limit': 3 * 1024 * 1024 * 1024,
'arena_extend_strategy': "kNextPowerOfTwo",
'cudnn_conv_algo_search': 'EXHAUSTIVE',
'do_copy_in_default_stream': True
}),
'CPUExecutionProvider',
]
self.sess = onnxruntime.InferenceSession(model_path,sess_options=self.sess_options, providers=providers)
@nistarlwc Have you solved the problem?
@nistarlwc Even though you're returning an existing instance of the Singleton class, won't the init call create a new session after new returns that instance? i.e. should Singleton.sess be set once in new and never in init?
@xiaolang564321 fwiw the CUDA EP doesn't support parallel execution, so setting the execution mode and inter_op_num_threads is meaningless.
Can you clarify what you mean by the gpu_mem_limit not working? That's the size used to limit the arena ORT creates for the CUDA EP. There will still be some other per-thread structures that use memory, and the CUDA library itself will use memory. If you turn on INFO level logging you should see output from the arena implementation saying how large it is and what allocations it has made.
@nistarlwc Even though you're returning an existing instance of the Singleton class, won't the init call create a new session after new returns that instance? i.e. should Singleton.sess be set once in new and never in init?
@xiaolang564321 fwiw the CUDA EP doesn't support parallel execution, so setting the execution mode and inter_op_num_threads is meaningless.
Can you clarify what you mean by the gpu_mem_limit not working? That's the size used to limit the arena ORT creates for the CUDA EP. There will still be some other per-thread structures that use memory, and the CUDA library itself will use memory. If you turn on INFO level logging you should see output from the arena implementation saying how large it is and what allocations it has made.
Sorry, I don't understand your reply. Can you modify my code to explain it?
My understanding is that the Flask HTTP server maybe create a session for each call, so although gpu_mem_limit has been set, when calling multiple the gpu_mem will be increase.
You could try limiting the memory size used by CUDA for each session by setting gpu_mem_limit: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#python
onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 18874368
However this is only an 18MB allocation. The way the arena works is to grow in chunks at least 2x the size of the last chunk. If it's only up to 18MB it's not being used much. Is something besides ORT also using GPU memory?
Another thing to try would be to set the logging severity to INFO level and see what sort of memory requirements each model has. I would do this separately so the output is clear. Look for output from bfs_arena.cc
import onnxruntime as ort so = ort.SessionOptions() so.log_severity_level = 1 # INFO level s = ort.InferenceSession('model.onnx', so, ort.get_available_providers())
When running the models does the input to each model have the same size each time? If not, disabling the memory pattern planner might help. Set
enable_mem_pattern
in SessionOptions to False to do that.
When set: log_severity_level = 1
output information: Creating and using per session threadpools since use_per_session_threads_ is true
Is it the problem? How to change it?
@nistarlwc Even though you're returning an existing instance of the Singleton class, won't the init call create a new session after new returns that instance? i.e. should Singleton.sess be set once in new and never in init?
@xiaolang564321 fwiw the CUDA EP doesn't support parallel execution, so setting the execution mode and inter_op_num_threads is meaningless.
Can you clarify what you mean by the gpu_mem_limit not working? That's the size used to limit the arena ORT creates for the CUDA EP. There will still be some other per-thread structures that use memory, and the CUDA library itself will use memory. If you turn on INFO level logging you should see output from the arena implementation saying how large it is and what allocations it has made.
I use Flask to encapsulate an ONNX model as a service. Set different Gpu_mem_limits to pressure test the service. The specific results are as follows
“gpu_mem_limit“: 3 1024 1024 * 1024 "intra_op_num_threads":10 id | concurrency | gpu 1 | 1 | 3.5g 2 | 2 | 4.5g 3 | 3 | 5.6g 4 | 5 | 5.6g 5 | 7 | 5.6g 6 | 10 | 5.6g
“gpu_mem_limit“: 6 1024 1024 * 1024 "intra_op_num_threads":10 id | concurrency | gpu 1 | 1 | 3.5g 2 | 2 | 4.5g 3 | 3 | 5.6g 4 | 5 | 5.6g 5 | 7 | 5.6g 6 | 10 | 5.6g
I set different Gpu_mem_limit, but consume the same GPU during pressure measurement . the gpu_mem_limit not working
Something like this:
class Singleton(object):
_instance_lock = threading.Lock()
def __new__(cls, *args, **kwargs):
if not hasattr(Singleton, "_instance"):
with Singleton._instance_lock:
if not hasattr(Singleton, "_instance"):
Singleton._instance = object.__new__(cls, *args, **kwargs)
Singleton._instance.sess = rt.InferenceSession(model1, providers=['CUDAExecutionProvider'])
return Singleton._instance
That will at least only create the InferenceSession once instead of the init recreating it each time.
I'm not familiar with Flask so I don't know if additional stuff is required to ensure it only creates one instance. e.g. this mentions a few other potential things that could be used like multiprocessing.Manager.
output information:
Creating and using per session threadpools since use_per_session_threads_ is true
Is it the problem? How to change it?
This is the normal. The memory usage is largely per-session not per-thread.
I set different Gpu_mem_limit, but consume the same GPU during pressure measurement . the gpu_mem_limit not working
Have you checked the output to ensure multiple InferenceSession instances are not being created? The memory limit setting is per-session. The log messages should show how large the arena in a session is. Look for BFCArena in the output.
@skottmckay Have a question, should use multiprocessing or multithreading?
I would have thought multi-threading was the only way to share the inference session unless you have something external providing it.
@skottmckay @xiaolang564321
I know that this problem is caused by thread isolation
.
Every time post, will create a new object. So all settings of session will fail.
But I don't know how to share the session without thread isolation.
@xiaolang564321 中国人?
@skottmckay
Try to use id(sess)
, check the memory address of sess
in each thread.
All addresses are some, so these threads are only new
one session.
@skottmckay @xiaolang564321 I know that this problem is caused by
thread isolation
. Every time post, will create a new object. So all settings of session will fail. But I don't know how to share the session without thread isolation.@xiaolang564321 中国人?
我的 中式英语有这么明显吗
@nistarlwc I use flask to encapsulate 2 ONNX models. Can the two models share gpus? In other words, one model needs 3G and the other needs 4G. Does the whole service need 4G GPU or 7G GPU?
@skottmckay @xiaolang564321 I know that this problem is caused by
thread isolation
. Every time post, will create a new object. So all settings of session will fail. But I don't know how to share the session without thread isolation. @xiaolang564321 中国人?我的 中式英语有这么明显吗
名字像拼音,加个Q交流吧,QQ286409171
@skottmckay @xiaolang564321
onnxruntime is not thread safe, I try to use multi-threads without Flask.
gpu_mem_limit fail,
intra_op_num_threads fail.
Have you sloved this problem? I found similar problem when deploy onnx model with Flask.
Such a serious problem has not been fixed so far, how can this project be used?
Describe the bug Created a server that can run 2 sessiones with multi-threads using Flask.
app = Flask(__name__)
app.run(host='127.0.0.1', port='12345', threaded=True)
When run 3 threads that the GPU's memory less than 8G, the program can run.
But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.
I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set
intra_op_num_threads = 2
orinter_op_num_threads = 2
oros.environ["OMP_NUM_THREADS"] = "2"
, but don't work.Is there any way to limit the number of threads used? Or do blocking and only predict the next one when the GPU has free memory?
System information