SATRN gives only NaN with onnxruntime

Antoine-Prieur commented 2 years ago

Hello, first, thanks for your work !

Describe the problem

I wanted to ask if SATRN was compatible with ONNX as the documentation says ? I've tried to convert different versions of SATRN to ONNX, the conversion seems to work (with a few warnings during the conversion), but when I try my model, it always gives me a tensor with only NaN values. I've tried with the full version of the model that I've trained with a custom config, and also the small version with the weights given in the MMOCR documentation, using the default config.

Reproduction

I used the following command to convert the model:

python -m tools.deploy configs/mmocr/text-recognition/text-recognition_onnxruntime_dynamic.py \ 
models/textrecog/satrn_small.py \
models/textrecog/satrn_small_20211009-2cf13355.pth \
models/textrecog/demo.jpg \
--work-dir models/textrecog/satrn_small/ \
--dump-info

which gave me the following output:

2022-07-11 14:34:21,218 - mmdeploy - INFO - Start pipeline mmdeploy.apis.pytorch2onnx.torch2onnx in subprocess
load checkpoint from local path: models/textrecog/satrn_small_20211009-2cf13355.pth
2022-07-11 14:34:49,445 - mmdeploy - WARNING - DeprecationWarning: get_onnx_config will be deprecated in the future. 
11
2022-07-11 14:34:49,446 - mmdeploy - INFO - Export PyTorch model to ONNX: models/textrecog/satrn_small/end2end.onnx.
/home/.../mmdeploy/.venv/lib64/python3.7/site-packages/torch/tensor.py:590: RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  'incorrect results).', category=RuntimeWarning)
/home/.../mmdeploy/mmdeploy/codebase/mmocr/models/text_recognition/base.py:51: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  img_shape = [int(val) for val in img_shape]
Backend TkAgg is interactive backend. Turning interactive mode on.
/home/.../mmdeploy/.venv/lib64/python3.7/site-packages/mmocr/models/textrecog/encoders/satrn_encoder.py:75: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  valid_width = min(w, math.ceil(w * valid_ratio))
/home/.../mmdeploy/.venv/lib64/python3.7/site-packages/mmocr/models/textrecog/encoders/satrn_encoder.py:75: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  valid_width = min(w, math.ceil(w * valid_ratio))
/home/.../mmdeploy/.venv/lib64/python3.7/site-packages/mmocr/models/textrecog/decoders/nrtr_decoder.py:126: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  valid_width = min(T, math.ceil(T * valid_ratio))
/home/.../mmdeploy/.venv/lib64/python3.7/site-packages/mmocr/models/textrecog/decoders/nrtr_decoder.py:126: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  valid_width = min(T, math.ceil(T * valid_ratio))
2022-07-11 14:55:32,616 - mmdeploy - INFO - Execute onnx optimize passes.
2022-07-11 14:55:32,619 - mmdeploy - WARNING - Can not optimize model, please build torchscipt extension.
More details: https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/experimental/onnx_optimizer.md
2022-07-11 14:55:37,522 - mmdeploy - INFO - Finish pipeline mmdeploy.apis.pytorch2onnx.torch2onnx
2022-07-11 14:55:39,435 - mmdeploy - INFO - visualize onnxruntime model start.
2022-07-11 14:56:09,272 - mmdeploy - INFO - Successfully loaded onnxruntime custom ops from             /home/.../mmdeploy/mmdeploy/lib/libmmdeploy_onnxruntime_ops.so
2022-07-11:14:56:09 - mmdeploy - INFO - Successfully loaded onnxruntime custom ops from             /home/.../mmdeploy/mmdeploy/lib/libmmdeploy_onnxruntime_ops.so
2022-07-11 14:56:21,617 - mmdeploy - INFO - visualize onnxruntime model success.
2022-07-11 14:56:21,617 - mmdeploy - INFO - visualize pytorch model start.
load checkpoint from local path: models/textrecog/satrn_small_20211009-2cf13355.pth
2022-07-11 14:56:54,364 - mmdeploy - INFO - visualize pytorch model success.
2022-07-11 14:56:54,364 - mmdeploy - INFO - All process success.

And the following code to test the model :

from onnxruntime import InferenceSession
import numpy as np

onnx_model = InferenceSession("models/textrecog/satrn_small/end2end.onnx")
test = onnx_model.run(input_feed={"input": np.random.randn(1, 3, 32, 100).astype(np.float32)}, output_names=["output"])

The variable test contains: Screenshot from 2022-07-11 15-21-35

Environment

❯ python tools/check_env.py
2022-07-11 15:10:23,354 - mmdeploy - INFO - 

2022-07-11 15:10:23,354 - mmdeploy - INFO - **********Environmental information**********
2022-07-11 15:10:24,698 - mmdeploy - INFO - sys.platform: linux
2022-07-11 15:10:24,699 - mmdeploy - INFO - Python: 3.7.13 (default, Jun 10 2022, 20:09:34) [GCC 11.3.1 20220421 (Red Hat 11.3.1-2)]
2022-07-11 15:10:24,699 - mmdeploy - INFO - CUDA available: False
2022-07-11 15:10:24,699 - mmdeploy - INFO - GCC: gcc (GCC) 11.3.1 20220421 (Red Hat 11.3.1-2)
2022-07-11 15:10:24,699 - mmdeploy - INFO - PyTorch: 1.8.0
2022-07-11 15:10:24,699 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

2022-07-11 15:10:24,699 - mmdeploy - INFO - TorchVision: 0.9.0
2022-07-11 15:10:24,699 - mmdeploy - INFO - OpenCV: 4.5.4
2022-07-11 15:10:24,699 - mmdeploy - INFO - MMCV: 1.4.0
2022-07-11 15:10:24,699 - mmdeploy - INFO - MMCV Compiler: GCC 7.3
2022-07-11 15:10:24,699 - mmdeploy - INFO - MMCV CUDA Compiler: not available
2022-07-11 15:10:24,699 - mmdeploy - INFO - MMDeploy: 0.6.0+69c38b9
2022-07-11 15:10:24,699 - mmdeploy - INFO - 

2022-07-11 15:10:24,699 - mmdeploy - INFO - **********Backend information**********
2022-07-11 15:10:25,233 - mmdeploy - INFO - onnxruntime: 1.8.1  ops_is_avaliable : True
2022-07-11 15:10:25,235 - mmdeploy - INFO - tensorrt: None      ops_is_avaliable : False
2022-07-11 15:10:25,257 - mmdeploy - INFO - ncnn: None  ops_is_avaliable : False
2022-07-11 15:10:25,258 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-07-11 15:10:25,260 - mmdeploy - INFO - openvino_is_avaliable: False
2022-07-11 15:10:25,260 - mmdeploy - INFO - 

2022-07-11 15:10:25,260 - mmdeploy - INFO - **********Codebase information**********
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmdet:      2.20.0
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmseg:      None
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmcls:      None
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmocr:      0.4.1
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmedit:     None
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmdet3d:    None
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmpose:     None
2022-07-11 15:10:25,262 - mmdeploy - INFO - mmrotate:   None

Thanks a lot in advance !

AllentDan commented 2 years ago

Hi, @Antoine-Prieur. Was the visualization okay? In my testing, the result is not Nan.

Antoine-Prieur commented 2 years ago

Hello, thank for your response,

I just tried, and the visualization is not okay neither, it just shows some zeros with onnxruntime, and the right prediction with pytorch.

May it be the fact that I don't have a GPU on the machine where I'm trying to convert the model ? Because I did a lot of tests on my side only on CPU, with different opset, torch versions, ... which always gave NaN values with the exact same messages during the conversion. Some of the operations in the graph may not be supported by CPU.

AllentDan commented 2 years ago

Hello, thank for your response,

I just tried, and the visualization is not okay neither, it just shows some zeros with onnxruntime, and the right prediction with pytorch.

May it be the fact that I don't have a GPU on the machine where I'm trying to convert the model ? Because I did a lot of tests on my side only on CPU, with different opset, torch versions, ... which always gave NaN values with the exact same messages during the conversion. Some of the operations in the graph may not be supported by CPU.

Did the error only show for SATRN or for other models as well? There is no operator that only supports GPU for SATRN. In fact, I ran the script you gave above successfully and it used only the CPU.

Antoine-Prieur commented 2 years ago

I tried to convert CRNN, and it worked well. I also did it with a few detection models, which worked fine too.

I've just done a fresh install, following the exact version in the installation guide, and I still have the same NaN issue when I watch the given tensor, and the vizualisation gives me zeros (it probably replaces NaN with zeros).

My guess is that I miss a CUDA/cUDNN dependency somewhere, I'm gonna try to do the same on a GPU cluster to see if it works.

Antoine-Prieur commented 2 years ago

I tried on two other different machines (with GPU this time), and I always have the exact same problem. I also installed the dependencies to support onnx-optimizer (to remove the warning WARNING - Can not optimize model, please build torchscipt extension. I had earlier). I made sure to have the exact same config file as satrn-config, and used the weights found here text recognition models for satrn-small.

Maybe I'm missing something in the setup of the project. I first created the Python venv (using conda) following the exact same thing as the guide for Linux. I installed onnxruntime, downloaded the linux prebuilt binary package, and exported the path to ONNXRUNTIME_DIR and LD_LIBRARY_PATH. To build, I used this command to support ort and ort optimization: cmake -DCMAKE_CXX_COMPILER=g++-7 -DTorch_DIR=${Torch_DIR} -DMMDEPLOY_TARGET_BACKENDS="ort;torchscript" -DONNXRUNTIME_DIR=${ONNXRUNTIME_DIR} (with the Torch_DIR variable set with export Torch_DIR=$(python -c "import torch;print(torch.utils.cmake_prefix_path + '/Torch')")). Then, I installed the project with pip install -e ., and MMOCR and his dependencies with:

pip install mmocr==0.4.1
pip install mmdet==2.20.0

I finally executed the script I sent earlier.

AllentDan commented 2 years ago

Okay, I will follow your step later and check if there is any possible bug.

AllentDan commented 2 years ago

Hi, @Antoine-Prieur. The bug get fixed in the latest mmdeploy.

Antoine-Prieur commented 2 years ago

Hello, thank you very much for your time, I tried the fix and it works well. Have a good day !

Phelan164 commented 2 years ago

@AllentDan @Antoine-Prieur did you try to run inference with batch? I tried and the result is not correct I think result from 2nd input aways same word the after decoding it

from onnxruntime import InferenceSession
import numpy as np

onnx_model = InferenceSession("models/textrecog/satrn_small/end2end.onnx")

inp = np.random.randn(3, 32, 100)
inps = np.array([inp for i in range(5)])
test = onnx_model.run(input_feed={"input": inps.astype(np.float32)}, output_names=["output"])
ocr = test[0]
dictionary = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]_`~"
for i in range(ocr.shape[0]):
    max_indices = []
    for outer in range(ocr.shape[1]):
        character_index = -1
        character_value = 0
        for inner in range(ocr.shape[2]):
            value = ocr[i][outer][inner]
            if value > character_value:
                character_value = value
                character_index = inner
        max_indices.append(character_index);
    recognized = ""

    for max_index in max_indices:
        if max_index == len(dictionary):
            continue #unk
        if max_index == len(dictionary) + 1:
            break #eos
        recognized += dictionary[max_index]
    print("--->>>>>> recognized", recognized)

The result I got:

--->>>>>> recognized newsgroups:
--->>>>>> recognized the
--->>>>>> recognized the
--->>>>>> recognized the
--->>>>>> recognized the

Antoine-Prieur commented 2 years ago

Hello @Phelan164, have you seen this issue : https://github.com/open-mmlab/mmdeploy/issues/791 ? I had a similar problem before, there were an issue with the triu function, and SATRN is using it. It's now fixed on master, but the release is not out yet.

Phelan164 commented 2 years ago

@Antoine-Prieur thanks for your answer. So to use the new triu function, need to build new sdk cpu + ONNXRuntime follow this for now?

hanhan1990 commented 2 years ago

@AllentDan @Antoine-Prieur did you try to run inference with batch? I tried and the result is not correct I think result from 2nd input aways same word the after decoding it

from onnxruntime import InferenceSession
import numpy as np

onnx_model = InferenceSession("models/textrecog/satrn_small/end2end.onnx")

inp = np.random.randn(3, 32, 100)
inps = np.array([inp for i in range(5)])
test = onnx_model.run(input_feed={"input": inps.astype(np.float32)}, output_names=["output"])
ocr = test[0]
dictionary = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]_`~"
for i in range(ocr.shape[0]):
    max_indices = []
    for outer in range(ocr.shape[1]):
        character_index = -1
        character_value = 0
        for inner in range(ocr.shape[2]):
            value = ocr[i][outer][inner]
            if value > character_value:
                character_value = value
                character_index = inner
        max_indices.append(character_index);
    recognized = ""

    for max_index in max_indices:
        if max_index == len(dictionary):
            continue #unk
        if max_index == len(dictionary) + 1:
            break #eos
        recognized += dictionary[max_index]
    print("--->>>>>> recognized", recognized)

The result I got:

--->>>>>> recognized newsgroups:
--->>>>>> recognized the
--->>>>>> recognized the
--->>>>>> recognized the
--->>>>>> recognized the

Have you solved this problem? I have the same problem.

AllentDan commented 2 years ago

@hanhan1990 @Phelan164 check this out please.

open-mmlab / mmdeploy