microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.36k stars 2.88k forks source link

How to use Flask with onnxruntime #11156

Open nistarlwc opened 2 years ago

nistarlwc commented 2 years ago

Describe the bug Created a server that can run 2 sessiones with multi-threads using Flask.
app = Flask(__name__) app.run(host='127.0.0.1', port='12345', threaded=True)

When run 3 threads that the GPU's memory less than 8G, the program can run.
But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.

I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work.

Is there any way to limit the number of threads used? Or do blocking and only predict the next one when the GPU has free memory?

System information

skottmckay commented 2 years ago

Are you creating different InferenceSession instances in each thread? That would cause a lot of excess memory usage as each session has its own memory pool. An InferenceSession is stateless and can be called concurrently from multiple threads.

nistarlwc commented 2 years ago

Are you creating different InferenceSession instances in each thread? That would cause a lot of excess memory usage as each session has its own memory pool. An InferenceSession is stateless and can be called concurrently from multiple threads.

Create 2 InferenceSession as global value.

from flask import Flask, request
app = Flask(__name__)

sess1 = rt.InferenceSession(model1)
sess2 = rt.InferenceSession(model2)

@app.route('/algorithm', methods=['POST'])
def parser():
    img = cv2.imread(...)
    prediction1 = sess1.run(...)
    prediction2 = sess2.run(...)

if __name__ == '__main__':
    app.run(host='127.0.0.1', port='12345', threaded=True)

And when the memory is greater than 8G, have error:

onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED ; GPU=0 ; hostname=LAPTOP-D5571KG6 ; expr=cublasCreate(&cublas_handle_); 
onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 18874368

how should it be modified?

nistarlwc commented 2 years ago

Are you creating different InferenceSession instances in each thread? That would cause a lot of excess memory usage as each session has its own memory pool. An InferenceSession is stateless and can be called concurrently from multiple threads.

I try to create one InferenceSession in threads, but fail

import onnxruntime as rt
from flask import Flask, request
import threading
app = Flask(__name__)

class Singleton(object):
    _instance_lock = threading.Lock()

    def __init__(self):
        self.sess = rt.InferenceSession(model1, providers=['CUDAExecutionProvider'])

    def __new__(cls, *args, **kwargs):
        if not hasattr(Singleton, "_instance"):
            with Singleton._instance_lock:
                if not hasattr(Singleton, "_instance"):
                    Singleton._instance = object.__new__(cls, *args, **kwargs)
        return Singleton._instance

sess1 = Singleton()

@app.route('/algorithm', methods=['POST'])
def parser():
    prediction1 = sess1.sess.run(...)

if __name__ == '__main__':
    app.run(host='127.0.0.1', port='12345', threaded=True)
skottmckay commented 2 years ago

You could try limiting the memory size used by CUDA for each session by setting gpu_mem_limit: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#python

onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 18874368

However this is only an 18MB allocation. The way the arena works is to grow in chunks at least 2x the size of the last chunk. If it's only up to 18MB it's not being used much. Is something besides ORT also using GPU memory?

Another thing to try would be to set the logging severity to INFO level and see what sort of memory requirements each model has. I would do this separately so the output is clear. Look for output from bfs_arena.cc

import onnxruntime as ort
so = ort.SessionOptions()
so.log_severity_level = 1  # INFO level
s = ort.InferenceSession('model.onnx', so, ort.get_available_providers())

When running the models does the input to each model have the same size each time? If not, disabling the memory pattern planner might help. Set enable_mem_pattern in SessionOptions to False to do that.

xiaolang564321 commented 2 years ago

I user python API,"gpu_mem_limit" not work . I also use flask to build services . During the compression test, the gpu usage is also improved concurrently。

self.sess_options = onnxruntime.SessionOptions()
self.sess_options.intra_op_num_threads = 10
self.sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_PARALLEL
self.sess_options.inter_op_num_threads = 10
self.sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL

providers = [
    ('CUDAExecutionProvider', {
            'device_id': 0,
            'gpu_mem_limit': 3 * 1024 * 1024 * 1024,
            'arena_extend_strategy': "kNextPowerOfTwo",
            'cudnn_conv_algo_search': 'EXHAUSTIVE',
            'do_copy_in_default_stream': True
            }),
    'CPUExecutionProvider',
]
self.sess = onnxruntime.InferenceSession(model_path,sess_options=self.sess_options, providers=providers)

@nistarlwc Have you solved the problem?

skottmckay commented 2 years ago

@nistarlwc Even though you're returning an existing instance of the Singleton class, won't the init call create a new session after new returns that instance? i.e. should Singleton.sess be set once in new and never in init?

@xiaolang564321 fwiw the CUDA EP doesn't support parallel execution, so setting the execution mode and inter_op_num_threads is meaningless.

Can you clarify what you mean by the gpu_mem_limit not working? That's the size used to limit the arena ORT creates for the CUDA EP. There will still be some other per-thread structures that use memory, and the CUDA library itself will use memory. If you turn on INFO level logging you should see output from the arena implementation saying how large it is and what allocations it has made.

nistarlwc commented 2 years ago

@nistarlwc Even though you're returning an existing instance of the Singleton class, won't the init call create a new session after new returns that instance? i.e. should Singleton.sess be set once in new and never in init?

@xiaolang564321 fwiw the CUDA EP doesn't support parallel execution, so setting the execution mode and inter_op_num_threads is meaningless.

Can you clarify what you mean by the gpu_mem_limit not working? That's the size used to limit the arena ORT creates for the CUDA EP. There will still be some other per-thread structures that use memory, and the CUDA library itself will use memory. If you turn on INFO level logging you should see output from the arena implementation saying how large it is and what allocations it has made.

Sorry, I don't understand your reply. Can you modify my code to explain it?
My understanding is that the Flask HTTP server maybe create a session for each call, so although gpu_mem_limit has been set, when calling multiple the gpu_mem will be increase.

nistarlwc commented 2 years ago

You could try limiting the memory size used by CUDA for each session by setting gpu_mem_limit: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#python

onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 18874368

However this is only an 18MB allocation. The way the arena works is to grow in chunks at least 2x the size of the last chunk. If it's only up to 18MB it's not being used much. Is something besides ORT also using GPU memory?

Another thing to try would be to set the logging severity to INFO level and see what sort of memory requirements each model has. I would do this separately so the output is clear. Look for output from bfs_arena.cc

import onnxruntime as ort
so = ort.SessionOptions()
so.log_severity_level = 1  # INFO level
s = ort.InferenceSession('model.onnx', so, ort.get_available_providers())

When running the models does the input to each model have the same size each time? If not, disabling the memory pattern planner might help. Set enable_mem_pattern in SessionOptions to False to do that.

When set: log_severity_level = 1
output information: Creating and using per session threadpools since use_per_session_threads_ is true
Is it the problem? How to change it?

xiaolang564321 commented 2 years ago

@nistarlwc Even though you're returning an existing instance of the Singleton class, won't the init call create a new session after new returns that instance? i.e. should Singleton.sess be set once in new and never in init?

@xiaolang564321 fwiw the CUDA EP doesn't support parallel execution, so setting the execution mode and inter_op_num_threads is meaningless.

Can you clarify what you mean by the gpu_mem_limit not working? That's the size used to limit the arena ORT creates for the CUDA EP. There will still be some other per-thread structures that use memory, and the CUDA library itself will use memory. If you turn on INFO level logging you should see output from the arena implementation saying how large it is and what allocations it has made.

I use Flask to encapsulate an ONNX model as a service. Set different Gpu_mem_limits to pressure test the service. The specific results are as follows

“gpu_mem_limit“: 3 1024 1024 * 1024 "intra_op_num_threads":10 id | concurrency | gpu 1 | 1 | 3.5g 2 | 2 | 4.5g  3 | 3 | 5.6g   4 | 5 | 5.6g  5 | 7 | 5.6g   6 | 10 | 5.6g  

“gpu_mem_limit“: 6 1024 1024 * 1024 "intra_op_num_threads":10 id | concurrency | gpu 1 | 1 | 3.5g 2 | 2 | 4.5g  3 | 3 | 5.6g   4 | 5 | 5.6g  5 | 7 | 5.6g   6 | 10 | 5.6g  

I set different Gpu_mem_limit, but consume the same GPU during pressure measurement . the gpu_mem_limit not working

skottmckay commented 2 years ago

Something like this:

class Singleton(object):
    _instance_lock = threading.Lock()

    def __new__(cls, *args, **kwargs):
        if not hasattr(Singleton, "_instance"):
            with Singleton._instance_lock:
                if not hasattr(Singleton, "_instance"):
                    Singleton._instance = object.__new__(cls, *args, **kwargs)
                    Singleton._instance.sess = rt.InferenceSession(model1, providers=['CUDAExecutionProvider'])
        return Singleton._instance

That will at least only create the InferenceSession once instead of the init recreating it each time.

I'm not familiar with Flask so I don't know if additional stuff is required to ensure it only creates one instance. e.g. this mentions a few other potential things that could be used like multiprocessing.Manager.

skottmckay commented 2 years ago

output information: Creating and using per session threadpools since use_per_session_threads_ is true Is it the problem? How to change it?

This is the normal. The memory usage is largely per-session not per-thread.

skottmckay commented 2 years ago

I set different Gpu_mem_limit, but consume the same GPU during pressure measurement . the gpu_mem_limit not working

Have you checked the output to ensure multiple InferenceSession instances are not being created? The memory limit setting is per-session. The log messages should show how large the arena in a session is. Look for BFCArena in the output.

nistarlwc commented 2 years ago

@skottmckay Have a question, should use multiprocessing or multithreading?

skottmckay commented 2 years ago

I would have thought multi-threading was the only way to share the inference session unless you have something external providing it.

nistarlwc commented 2 years ago

@skottmckay @xiaolang564321 I know that this problem is caused by thread isolation.
Every time post, will create a new object. So all settings of session will fail. But I don't know how to share the session without thread isolation.

@xiaolang564321 中国人?

nistarlwc commented 2 years ago

@skottmckay Try to use id(sess), check the memory address of sess in each thread.
All addresses are some, so these threads are only new one session.

xiaolang564321 commented 2 years ago

@skottmckay @xiaolang564321 I know that this problem is caused by thread isolation. Every time post, will create a new object. So all settings of session will fail. But I don't know how to share the session without thread isolation.

@xiaolang564321 中国人?

我的 中式英语有这么明显吗

xiaolang564321 commented 2 years ago

@nistarlwc I use flask to encapsulate 2 ONNX models. Can the two models share gpus? In other words, one model needs 3G and the other needs 4G. Does the whole service need 4G GPU or 7G GPU?

nistarlwc commented 2 years ago

@skottmckay @xiaolang564321 I know that this problem is caused by thread isolation. Every time post, will create a new object. So all settings of session will fail. But I don't know how to share the session without thread isolation. @xiaolang564321 中国人?

我的 中式英语有这么明显吗

名字像拼音,加个Q交流吧,QQ286409171

nistarlwc commented 2 years ago

@skottmckay @xiaolang564321 onnxruntime is not thread safe, I try to use multi-threads without Flask.
gpu_mem_limit fail,
intra_op_num_threads fail.

sddygaizhihao commented 1 year ago

Have you sloved this problem? I found similar problem when deploy onnx model with Flask.

jindameias commented 8 months ago

Such a serious problem has not been fixed so far, how can this project be used?