microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.26k stars 2.87k forks source link

[Performance] Unexpected prediction for OCR model in Flask multithreading #21288

Closed KhanhDinhDuy closed 2 months ago

KhanhDinhDuy commented 3 months ago

Describe the issue

I have an OCR model with the following architecture ResNet-BiLSTM-CTC OS environment:

cuda_provider_options = {'gpu_mem_limit': 2 1024 1024 * 1024} providers = [("CUDAExecutionProvider", cuda_provider_options), "CPUExecutionProvider"]

When I test the model normal with only 1 main process and dynamic batch sizes during inference, the model runs normally.

But, when I serve it with Flask and multi threading (2 threads), I got unexpected outputs. Most of the time, the outputs are as my expectations. But, sometime, I got something strange like "", "c Dc D A ct D c t I m ​​I N i o cI n c", ...

Note: I use the same samples during the test.

To reproduce

Note:

Step 1: Normal inference after training When I test the model normal with only 1 main process and dynamic batch sizes during inference, the model runs normally.

Step 2: Serve the service with Flask and multi threading But, when I serve it with Flask and multi threading (2 threads), I got unexpected outputs. Most of the time, the outputs are as my expectations. But, sometime, I got something strange like "", "c Dc D A ct D c t I m ​​I N i o cI n c", ...

When error occurs:

Urgency

No response

Platform

Linux

OS Version

ubuntu20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda:11.6.2

Model File

No response

Is this a quantized model?

No

tianleiwu commented 2 months ago

It's likely some operator is not thread safe. To identify which operator having the issue in cuda, we can force the op to fallback to cpu (need comment out some operator in RegisterCudaContribKernels or RegisterCudaKernels and build from source). If you need assistant, please share your onnx model and an example input that can reproduce the issue.

KhanhDinhDuy commented 2 months ago

It's likely some operator is not thread safe. To identify which operator having the issue in cuda, we can force the op to fallback to cpu (need comment out some operator in RegisterCudaContribKernels or RegisterCudaKernels and build from source). If you need assistant, please share your onnx model and an example input that can reproduce the issue.

Thanks you for your recommendation. I found a solution here: https://github.com/microsoft/onnxruntime/issues/15154 And the option "options.enable_mem_pattern = False" resolve my problem. Thanks you!