open-mmlab / mmdeploy

OpenMMLab Model Deployment Framework
https://mmdeploy.readthedocs.io/en/latest/
Apache License 2.0
2.74k stars 628 forks source link

[Bug] Error when breaking point in symbolic function #1358

Open vansin opened 1 year ago

vansin commented 1 year ago

Checklist

Describe the bug

Debug在symbolic函数中打断点时会报错, 如果不在symbolic函数打断点将会成功转换。 Debug will report an error if you break the point in the symbolic function, if you don't break the point in the symbolic function, the conversion will be successful.

image

(mmdeploy-ncnn) ➜  mmdeploy git:(debug) python -m debugpy --listen 5678 --wait-for-client ./tools/deploy.py \
/project/mmdeploy-ncnn/mmdeploy/configs/mmdet/detection/single-stage_ncnn_static-800x1344.py \
/project/mmdeploy-ncnn/mmdetection/configs/yolo/yolov3_d53_8xb8-320-273e_coco.py \
"/project/mmdeploy-ncnn/mmdeploy_checkpoints/mmdet/yolov3/yolov3_d53_320_273e_coco-421362b6.pth" \
"../mmdetection/demo/demo.jpg"  \
--work-dir "../mmdeploy_regression_working_dir/mmdet/yolov3/ncnn/static/fp32/yolov3_d53_320_273e_coco-421362b6"  \
--device cpu  \
--log-level INFO \
--test-img ./tests/data/tiger.jpeg

[2022-11-12 15:10:47.937] [mmdeploy] [info] [model.cpp:98] Register 'DirectoryModel'
[2022-11-12 15:10:50.784] [mmdeploy] [info] [model.cpp:98] Register 'DirectoryModel'
[2022-11-12 15:11:04.564] [mmdeploy] [info] [model.cpp:98] Register 'DirectoryModel'
11/12 15:11:04 - mmengine - INFO - Start pipeline mmdeploy.apis.pytorch2onnx.torch2onnx in subprocess
11/12 15:11:05 - mmengine - WARNING - Failed to search registry with scope "mmdet" in the "Codebases" registry tree. As a workaround, the current "Codebases" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmdet" is a correct scope, or whether the registry is initialized.
11/12 15:11:05 - mmengine - WARNING - Failed to search registry with scope "mmdet" in the "mmdet_tasks" registry tree. As a workaround, the current "mmdet_tasks" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmdet" is a correct scope, or whether the registry is initialized.
11/12 15:11:05 - mmengine - WARNING - Failed to search registry with scope "mmdet" in the "Codebases" registry tree. As a workaround, the current "Codebases" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmdet" is a correct scope, or whether the registry is initialized.
local loads checkpoint from path: /project/mmdeploy-ncnn/mmdeploy_checkpoints/mmdet/yolov3/yolov3_d53_320_273e_coco-421362b6.pth
11/12 15:11:06 - mmengine - WARNING - DeprecationWarning: get_onnx_config will be deprecated in the future. 
11/12 15:11:06 - mmengine - INFO - Export PyTorch model to ONNX: ../mmdeploy_regression_working_dir/mmdet/yolov3/ncnn/static/fp32/yolov3_d53_320_273e_coco-421362b6/end2end.onnx.
11/12 15:11:06 - mmengine - WARNING - Can not find mmdet.models.dense_heads.RPNHead.get_bboxes, function rewrite will not be applied
/project/mmdeploy-ncnn/mmdeploy/mmdeploy/codebase/mmdet/models/detectors/single_stage.py:66: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  img_shape = [int(val) for val in img_shape]
/project/mmdeploy-ncnn/mmdeploy/mmdeploy/codebase/mmdet/models/detectors/single_stage.py:66: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  img_shape = [int(val) for val in img_shape]
/project/mmdeploy-ncnn/mmdeploy/mmdeploy/pytorch/functions/getattribute.py:18: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  ret = torch.Size([int(s) for s in ret])
Backend TkAgg is interactive backend. Turning interactive mode on.
/project/mmdeploy-ncnn/mmdeploy/mmdeploy/pytorch/functions/getattribute.py:18: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  ret = torch.Size([int(s) for s in ret])
WARNING: The shape inference of mmdeploy::Yolov3DetectionOutput type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Process Process-2:
Traceback (most recent call last):
  File "/home/huwenxing/miniconda3/envs/mmdeploy-ncnn/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/huwenxing/miniconda3/envs/mmdeploy-ncnn/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/core/pipeline_manager.py", line 107, in __call__
    ret = func(*args, **kwargs)
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/pytorch2onnx.py", line 97, in torch2onnx
    export(
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/core/pipeline_manager.py", line 356, in _wrap
    return self.call_function(func_name_, *args, **kwargs)
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/core/pipeline_manager.py", line 326, in call_function
    return self.call_function_local(func_name, *args, **kwargs)
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/core/pipeline_manager.py", line 275, in call_function_local
    return pipe_caller(*args, **kwargs)
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/core/pipeline_manager.py", line 107, in __call__
    ret = func(*args, **kwargs)
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/onnx/export.py", line 123, in export
    torch.onnx.export(
  File "/home/huwenxing/miniconda3/envs/mmdeploy-ncnn/lib/python3.8/site-packages/torch/onnx/__init__.py", line 316, in export
    return utils.export(model, args, f, export_params, verbose, training,
  File "/home/huwenxing/miniconda3/envs/mmdeploy-ncnn/lib/python3.8/site-packages/torch/onnx/utils.py", line 107, in export
    _export(model, args, f, export_params, verbose, training, input_names, output_names,
  File "/home/huwenxing/miniconda3/envs/mmdeploy-ncnn/lib/python3.8/site-packages/torch/onnx/utils.py", line 724, in _export
    _model_to_graph(model, args, verbose, input_names,
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/core/rewriters/rewriter_utils.py", line 379, in wrapper
    return self.func(self, *args, **kwargs)
  File "/project/mmdeploy-ncnn/mmdeploy/mmdeploy/apis/onnx/optimizer.py", line 10, in model_to_graph__custom_optimizer
    graph, params_dict, torch_out = ctx.origin_func(*args, **kwargs)
  File "/home/huwenxing/miniconda3/envs/mmdeploy-ncnn/lib/python3.8/site-packages/torch/onnx/utils.py", line 497, in _model_to_graph
    graph = _optimize_graph(graph, operator_export_type,
  File "/home/huwenxing/miniconda3/envs/mmdeploy-ncnn/lib/python3.8/site-packages/torch/onnx/utils.py", line 217, in _optimize_graph
    torch._C._jit_pass_lint(graph)
RuntimeError: Unable to cast from non-held to held instance (T& to Holder<T>) (compile in debug mode for type information)

Environment

(mmdeploy-ncnn) ➜  mmdeploy git:(debug) python tools/check_env.py 
11/12 15:14:33 - mmengine - INFO - 

11/12 15:14:33 - mmengine - INFO - **********Environmental information**********
11/12 15:14:33 - mmengine - INFO - sys.platform: linux
11/12 15:14:33 - mmengine - INFO - Python: 3.8.13 (default, Oct 21 2022, 23:50:54) [GCC 11.2.0]
11/12 15:14:33 - mmengine - INFO - CUDA available: True
11/12 15:14:33 - mmengine - INFO - numpy_random_seed: 2147483648
11/12 15:14:33 - mmengine - INFO - GPU 0: NVIDIA GeForce GTX 1660 Ti
11/12 15:14:33 - mmengine - INFO - CUDA_HOME: /usr/local/cuda-11.3
11/12 15:14:33 - mmengine - INFO - NVCC: Cuda compilation tools, release 11.3, V11.3.58
11/12 15:14:33 - mmengine - INFO - GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
11/12 15:14:33 - mmengine - INFO - PyTorch: 1.10.0+cu113
11/12 15:14:33 - mmengine - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

11/12 15:14:33 - mmengine - INFO - TorchVision: 0.11.0+cu113
11/12 15:14:33 - mmengine - INFO - OpenCV: 4.6.0
11/12 15:14:33 - mmengine - INFO - MMEngine: 0.3.0
11/12 15:14:33 - mmengine - INFO - MMCV: 2.0.0rc2
11/12 15:14:33 - mmengine - INFO - MMCV Compiler: GCC 9.3
11/12 15:14:33 - mmengine - INFO - MMCV CUDA Compiler: 11.3
11/12 15:14:33 - mmengine - INFO - MMDeploy: 0.10.0+34419ab
11/12 15:14:33 - mmengine - INFO - 

11/12 15:14:33 - mmengine - INFO - **********Backend information**********
11/12 15:14:33 - mmengine - INFO - onnxruntime: None    ops_is_avaliable : False
11/12 15:14:33 - mmengine - INFO - tensorrt: None       ops_is_avaliable : False
11/12 15:14:33 - mmengine - INFO - ncnn: 1.0.20221110   ops_is_avaliable : True
11/12 15:14:33 - mmengine - INFO - pplnn_is_avaliable: False
11/12 15:14:33 - mmengine - INFO - openvino_is_avaliable: False
11/12 15:14:33 - mmengine - INFO - snpe_is_available: False
11/12 15:14:33 - mmengine - INFO - ascend_is_available: False
11/12 15:14:33 - mmengine - INFO - coreml_is_available: False
11/12 15:14:33 - mmengine - INFO - 

11/12 15:14:33 - mmengine - INFO - **********Codebase information**********
11/12 15:14:33 - mmengine - INFO - mmdet:       3.0.0rc3
11/12 15:14:33 - mmengine - INFO - mmseg:       None
11/12 15:14:33 - mmengine - INFO - mmcls:       None
11/12 15:14:33 - mmengine - INFO - mmocr:       None
11/12 15:14:33 - mmengine - INFO - mmedit:      None
11/12 15:14:33 - mmengine - INFO - mmdet3d:     None
11/12 15:14:33 - mmengine - INFO - mmpose:      None
11/12 15:14:33 - mmengine - INFO - mmrotate:    None
11/12 15:14:33 - mmengine - INFO - mmaction:    None

Error traceback

No response

grimoire commented 1 year ago

Honestly, I don't know. The log comes from pybind11 and I am not a pro at that. PyTorch would call the symbolic method in torch._C._jit_pass_onnx https://github.com/pytorch/pytorch/blob/bdc9911575277848ccac56b344dd624aa97fb87d/torch/csrc/jit/passes/onnx.cpp#L544

Which would load the method through pybind11. I guess the debugger/tracer broke something, but I have no idea how to fix it. It would be a good idea to send an issue to PyTorch, they might have a fix.