microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.59k stars 2.92k forks source link

[Performance] #15265

Open JiaYK opened 1 year ago

JiaYK commented 1 year ago

Describe the issue

Hello!

When I use Gunicorn to start multiple processes on the one graphics card, the onnxruntime gpu 1.14.1 calculates an error under high pressure and will not report any errors! This is very strange. If I only use one process, there is no problem. When I install onnxruntime gpu 1.14.0, there is no such problem. and

options.enable mem pattern = False

options.enable mem reuse = False

options.enable cpu mem_ arena = False

options.execution mode = ort.ExecutionMode.ORT SEQUENTIAL

options.graph optimization level = ort.GraphOptimizationLevel.ORT_ DISABLE_ALL

These settings also do not guarantee correct calculation results

To reproduce

@torch.no_grad() def call(self, batch, cfg):

self.model is a fp16 Torch model

    output_dict = self.model.inference(**batch, **cfg)
    if self.bind_device == 'cpu':
        output_dict['g'] = output_dict['g'].half()
        output_dict['z'] = output_dict['z'].half()
    # z [B, D, T]
    z = output_dict['z']
    g = output_dict['g']

    z = z.contiguous()
    g = g.contiguous()

    real_frame = z.shape[2]
    z = torch.nn.functional.pad(
        z, (0, self.inf_once_frame - real_frame % self.inf_once_frame), mode='constant', value=0)
    z = z.transpose(1,2).reshape([-1, self.inf_once_frame, self.inner_dim]).transpose(1,2)
    z = z.contiguous()
    batch_size = z.shape[0]

    g = g.repeat(batch_size, 1, 1)
    g = g.contiguous()

    self.io_binding_decoder.bind_input(
        name='c',
        device_type=self.bind_device,
        device_id=self.bind_device_id,
        element_type=float16,
        shape=tuple(z.shape),
        buffer_ptr=z.data_ptr(),
    )
    self.io_binding_decoder.bind_input(
        name='g',
        device_type=self.bind_device,
        device_id=self.bind_device_id,
        element_type=float16,
        shape=tuple(g.shape),
        buffer_ptr=g.data_ptr(),
    )

    out_shape = [batch_size, 1, z.shape[2]*self.upsample_scale]
    wavs_tensor = torch.empty(
        out_shape, dtype=torch.float16, device=self.device).contiguous()

    self.io_binding_decoder.bind_output(
        name='wav',
        device_type=self.bind_device,
        device_id=self.bind_device_id,
        element_type=float16,
        shape=tuple(wavs_tensor.shape),
        buffer_ptr=wavs_tensor.data_ptr(),
    )

    self.sess_decoder.run_with_iobinding(self.io_binding_decoder)
    out_wav = wavs_tensor.flatten()[:real_frame*self.upsample_scale]
    output_dict.update(wav=out_wav)
    return output_dict

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA11.7

Model File

No response

Is this a quantized model?

Yes

JiaYK commented 1 year ago

And only when there are a lot of requests that are calculated at the same time will the result be incorrect. Just starting multiple processes will not cause a calculation error. From the perspective of the result, it feels like Cuda is executing in disorder

pranavsharma commented 1 year ago

Can you describe the issue in more detail and steps to repro this? You've 2 processes each creating a session for a different model and each targeting a different device id and this doesn't work?

JiaYK commented 1 year ago

Hello! Thank you for your reply.

I have two processes, each of which creates one session. There are two sessions in total, and the models running in the two sessions are exactly the same. They are both placed in the same device (an A40 graphics card) and run on a Flask web service started using Gunicorn. There is no problem when accessing the web service at low frequencies. However, when I run a load test (using Siege), if I access the service at the same time, the results of the model are incorrect (with some probability correct, with some probability slightly off, and with some probability completely wrong). I have used fp16 quantization to optimize speed. According to the nvidia-smi display, the GPU usage is 100% when the errors occur.

JiaYK commented 1 year ago

Below, I will briefly describe using pseudocode

flask.py

import onnxruntime as ort from flask import Flask, jsonify, render_template, redirect app = Flask(name)

class Model: def init(self): providers = [('CUDAExecutionProvider', { "cudnn_conv_algo_search": 'DEFAULT', 'do_copy_in_default_stream': True, 'cudnn_conv_use_max_workspace': 1, 'cudnn_conv1d_pad_to_nc1d': 1, }), 'CPUExecutionProvider'] self.sess = ort.InferenceSession(f'{model_dir}/model.onnx', sess_options=options, providers=providers) self.io_binding = self.sess.io_binding() def call(self, args):

binding fisrt (data from pytorch:cuda)

    self.sess.run_with_iobinding(self.io_binding)
    return result

model = Model()

@app.route('/') def index():

get some args

result = model(args)
return jsonify(result=result)

run command (5 workers)

gunicorn -b 0.0.0.0:8000 -w 5 -timeout 0 flask:app

stress testing

siege -c 10 -r 300 'http://127.0.0.1/someargs'

accessing in a browser and result is wrong.

pranavsharma commented 1 year ago

Since you don't have device id in your config, the default device id is 0. This means you're sharing the same device between 2 processes. This may introduce context switching and you may not see concurrent execution. You mentioned that this doesn't occur with 1.14.0 which is a bit surprising since this is general GPU behavior, not related to ORT ver. Are you able to use 2 different devices to keep things separate?

JiaYK commented 1 year ago

Yes, two processes share a single GPU, but different processes have their own independent memory space. This should be quite common. For example, training two models simultaneously on a single GPU. I put multiple processes on the same device because the models are relatively small. If I only use one process, the GPU utilization rate may be as low as 15%. Multiple processes can complete more requests per second.