[Performance] Unexpected prediction for OCR model in Flask multithreading

KhanhDinhDuy commented 3 months ago

Describe the issue

I have an OCR model with the following architecture ResNet-BiLSTM-CTC OS environment:

cuda:11.6.2
python 3.7
onnxruntime-gpu==1.14.1
torch 1.10.0 cpu

cuda_provider_options = {'gpu_mem_limit': 2 1024 1024 * 1024} providers = [("CUDAExecutionProvider", cuda_provider_options), "CPUExecutionProvider"]

When I test the model normal with only 1 main process and dynamic batch sizes during inference, the model runs normally.

But, when I serve it with Flask and multi threading (2 threads), I got unexpected outputs. Most of the time, the outputs are as my expectations. But, sometime, I got something strange like "", "c Dc D A ct D c t I m I N i o cI n c", ...

Note: I use the same samples during the test.

To reproduce

Note:

I use the same samples during the test.

Step 1: Normal inference after training When I test the model normal with only 1 main process and dynamic batch sizes during inference, the model runs normally.

Step 2: Serve the service with Flask and multi threading But, when I serve it with Flask and multi threading (2 threads), I got unexpected outputs. Most of the time, the outputs are as my expectations. But, sometime, I got something strange like "", "c Dc D A ct D c t I m I N i o cI n c", ...

When error occurs:

Number of samples to predict = 20 and Batch size = 20 -> all the outputs are wrong
Number of samples to predict = 20 and Batch size = 1 -> 18-19 outputs are correct, but 1-2 outputs are wrong (in the middle of the prediction)

Urgency

No response

Platform

Linux

OS Version

ubuntu20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda:11.6.2

Model File

No response

Is this a quantized model?

No

tianleiwu commented 2 months ago

It's likely some operator is not thread safe. To identify which operator having the issue in cuda, we can force the op to fallback to cpu (need comment out some operator in RegisterCudaContribKernels or RegisterCudaKernels and build from source). If you need assistant, please share your onnx model and an example input that can reproduce the issue.

KhanhDinhDuy commented 2 months ago

It's likely some operator is not thread safe. To identify which operator having the issue in cuda, we can force the op to fallback to cpu (need comment out some operator in RegisterCudaContribKernels or RegisterCudaKernels and build from source). If you need assistant, please share your onnx model and an example input that can reproduce the issue.

Thanks you for your recommendation. I found a solution here: https://github.com/microsoft/onnxruntime/issues/15154 And the option "options.enable_mem_pattern = False" resolve my problem. Thanks you!

microsoft / onnxruntime