Open JiaYK opened 1 year ago
And only when there are a lot of requests that are calculated at the same time will the result be incorrect. Just starting multiple processes will not cause a calculation error. From the perspective of the result, it feels like Cuda is executing in disorder
Can you describe the issue in more detail and steps to repro this? You've 2 processes each creating a session for a different model and each targeting a different device id and this doesn't work?
Hello! Thank you for your reply.
I have two processes, each of which creates one session. There are two sessions in total, and the models running in the two sessions are exactly the same. They are both placed in the same device (an A40 graphics card) and run on a Flask web service started using Gunicorn. There is no problem when accessing the web service at low frequencies. However, when I run a load test (using Siege), if I access the service at the same time, the results of the model are incorrect (with some probability correct, with some probability slightly off, and with some probability completely wrong). I have used fp16 quantization to optimize speed. According to the nvidia-smi display, the GPU usage is 100% when the errors occur.
Below, I will briefly describe using pseudocode
import onnxruntime as ort from flask import Flask, jsonify, render_template, redirect app = Flask(name)
class Model: def init(self): providers = [('CUDAExecutionProvider', { "cudnn_conv_algo_search": 'DEFAULT', 'do_copy_in_default_stream': True, 'cudnn_conv_use_max_workspace': 1, 'cudnn_conv1d_pad_to_nc1d': 1, }), 'CPUExecutionProvider'] self.sess = ort.InferenceSession(f'{model_dir}/model.onnx', sess_options=options, providers=providers) self.io_binding = self.sess.io_binding() def call(self, args):
self.sess.run_with_iobinding(self.io_binding)
return result
model = Model()
@app.route('/') def index():
result = model(args)
return jsonify(result=result)
gunicorn -b 0.0.0.0:8000 -w 5 -timeout 0 flask:app
siege -c 10 -r 300 'http://127.0.0.1/someargs'
Since you don't have device id in your config, the default device id is 0. This means you're sharing the same device between 2 processes. This may introduce context switching and you may not see concurrent execution. You mentioned that this doesn't occur with 1.14.0 which is a bit surprising since this is general GPU behavior, not related to ORT ver. Are you able to use 2 different devices to keep things separate?
Yes, two processes share a single GPU, but different processes have their own independent memory space. This should be quite common. For example, training two models simultaneously on a single GPU. I put multiple processes on the same device because the models are relatively small. If I only use one process, the GPU utilization rate may be as low as 15%. Multiple processes can complete more requests per second.
Describe the issue
Hello!
When I use Gunicorn to start multiple processes on the one graphics card, the onnxruntime gpu 1.14.1 calculates an error under high pressure and will not report any errors! This is very strange. If I only use one process, there is no problem. When I install onnxruntime gpu 1.14.0, there is no such problem. and
options.enable mem pattern = False
options.enable mem reuse = False
options.enable cpu mem_ arena = False
options.execution mode = ort.ExecutionMode.ORT SEQUENTIAL
options.graph optimization level = ort.GraphOptimizationLevel.ORT_ DISABLE_ALL
These settings also do not guarantee correct calculation results
To reproduce
@torch.no_grad() def call(self, batch, cfg):
self.model is a fp16 Torch model
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu 1.14.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA11.7
Model File
No response
Is this a quantized model?
Yes