open-mmlab / mmdeploy

OpenMMLab Model Deployment Framework
https://mmdeploy.readthedocs.io/en/latest/
Apache License 2.0
2.71k stars 620 forks source link

ONNX parser isn't working with official TensorRT onnx parser #970

Closed IuliuNovac closed 1 year ago

IuliuNovac commented 2 years ago

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug

I have been converting models to onnx, followed by trying to parse the models with a TensorRT version of onnx but the parsing fails. No error what's so ever, just return false. The steps for reproducing the error are below.

However, when parsing other onnx files, there are no issues. It happens only with mmdeploy version.

The odd part is, that I can open it with Netron, so it's not broken because of the "cp" bug in linux.

**Reproduction**

Weights: wget -P checkpoints https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth

Model conversion: python ~/mmdeploy/tools/deploy.py \ ~/mmdeploy/configs/mmdet/detection/detection_onnxruntime_dynamic.py \ ~/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \ ~/checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \ ~/mmdetection/demo/demo.jpg \ --work-dir mmdeploy_model/faster-rcnn \ --device cpu \ --dump-info

So far looks alright, but when loaded with external onnx, the parsing fails. Pathing is right, since I check it. I am using latest TensorRT (8.4).

TensorRT - https://github.com/NVIDIA/TensorRT with their onnx - https://github.com/onnx/onnx-tensorrt/tree/9f82b2b6072be6c01f65306388e5c07621d3308f

    auto parser = std::unique_ptr<nvonnxparser::IParser>(
            nvonnxparser::createParser(*network, m_logger));
    if (!parser) {
        throw std::runtime_error("Parser failed");
    }  
    std::ifstream file(onnxModelPath, std::ios::binary | std::ios::ate);
    std::streamsize size = file.tellg();
    file.seekg(0, std::ios::beg);

    std::vector<char> buffer(size);
    if (!file.read(buffer.data(), size)) {
        throw std::runtime_error("Unable to read engine file");
    }

    auto parsed = parser->parse(buffer.data(), buffer.size());
    if (!parsed) {
        throw std::runtime_error("Parser failed");
    }
  1. Did you make any modifications on the code or config? Did you understand what you have modified? No modifications were made.

Environment 2022-08-31 19:56:43,400 - mmdeploy - INFO - **Environmental information** 2022-08-31 19:56:43,574 - mmdeploy - INFO - sys.platform: linux 2022-08-31 19:56:43,574 - mmdeploy - INFO - Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] 2022-08-31 19:56:43,574 - mmdeploy - INFO - CUDA available: True 2022-08-31 19:56:43,574 - mmdeploy - INFO - GPU 0: NVIDIA A10G 2022-08-31 19:56:43,574 - mmdeploy - INFO - CUDA_HOME: /usr 2022-08-31 19:56:43,574 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 10.1, V10.1.24 2022-08-31 19:56:43,574 - mmdeploy - INFO - GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 2022-08-31 19:56:43,574 - mmdeploy - INFO - PyTorch: 1.11.0+cu113 2022-08-31 19:56:43,574 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:

2022-08-31 19:56:43,574 - mmdeploy - INFO - TorchVision: 0.12.0+cu113 2022-08-31 19:56:43,574 - mmdeploy - INFO - OpenCV: 4.5.5 2022-08-31 19:56:43,574 - mmdeploy - INFO - MMCV: 1.6.0 2022-08-31 19:56:43,574 - mmdeploy - INFO - MMCV Compiler: GCC 9.3 2022-08-31 19:56:43,574 - mmdeploy - INFO - MMCV CUDA Compiler: 11.3 2022-08-31 19:56:43,574 - mmdeploy - INFO - MMDeploy: 0.7.0+83b11bc 2022-08-31 19:56:43,574 - mmdeploy - INFO -

2022-08-31 19:56:43,574 - mmdeploy - INFO - **Backend information** 2022-08-31 19:56:43,962 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True 2022-08-31 19:56:43,986 - mmdeploy - INFO - tensorrt: 8.4.1.5 ops_is_avaliable : True 2022-08-31 19:56:44,000 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False 2022-08-31 19:56:44,102 - mmdeploy - INFO - pplnn_is_avaliable: True 2022-08-31 19:56:44,117 - mmdeploy - INFO - openvino_is_avaliable: True 2022-08-31 19:56:44,131 - mmdeploy - INFO - snpe_is_available: False 2022-08-31 19:56:44,131 - mmdeploy - INFO -

2022-08-31 19:56:44,131 - mmdeploy - INFO - **Codebase information** 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmdet: 2.25.0 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmseg: None 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmcls: None 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmocr: None 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmedit: None 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmdet3d: None 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmpose: None 2022-08-31 19:56:44,132 - mmdeploy - INFO - mmrotate: None

Error traceback None just fails.

Bug fix I have no clue why it happens unless you modified onnx with some customer stuff.

tpoisonooo commented 2 years ago

@lvhan028

IuliuNovac commented 2 years ago

Warnings when making the onnx file. [2022-08-31 19:27:47.396] [mmdeploy] [info] [model.cpp:95] Register 'DirectoryModel' [2022-08-31 19:27:48.821] [mmdeploy] [info] [model.cpp:95] Register 'DirectoryModel' /home/ubuntu/mmdetection/mmdet/datasets/utils.py:66: UserWarning: "ImageToTensor" pipeline is replaced by "DefaultFormatBundle" for batch inference. It is recommended to manually replace it in the test data pipeline in your config file. warnings.warn( [2022-08-31 19:27:51.289] [mmdeploy] [info] [model.cpp:95] Register 'DirectoryModel' 2022-08-31 19:27:51,295 - mmdeploy - INFO - Start pipeline mmdeploy.apis.pytorch2onnx.torch2onnx in subprocess load checkpoint from local path: /home/ubuntu/checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth /home/ubuntu/mmdetection/mmdet/datasets/utils.py:66: UserWarning: "ImageToTensor" pipeline is replaced by "DefaultFormatBundle" for batch inference. It is recommended to manually replace it in the test data pipeline in your config file. warnings.warn( 2022-08-31 19:27:52,719 - mmdeploy - WARNING - DeprecationWarning: get_onnx_config will be deprecated in the future. 2022-08-31 19:27:52,719 - mmdeploy - INFO - Export PyTorch model to ONNX: mmdeploy_model/faster-rcnn/end2end.onnx. /home/ubuntu/mmdeploy/mmdeploy/core/optimizers/function_marker.py:158: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! ys_shape = tuple(int(s) for s in ys.shape) /home/ubuntu/mmdetection/mmdet/models/dense_heads/anchor_head.py:123: UserWarning: DeprecationWarning: anchor_generator is deprecated, please use "prior_generator" instead warnings.warn('DeprecationWarning: anchor_generator is deprecated, ' /home/ubuntu/mmdetection/mmdet/core/anchor/anchor_generator.py:333: UserWarning: grid_anchors would be deprecated soon. Please use grid_priors warnings.warn('grid_anchors would be deprecated soon. ' /home/ubuntu/mmdetection/mmdet/core/anchor/anchor_generator.py:369: UserWarning: single_level_grid_anchors would be deprecated soon. Please use single_level_grid_priors warnings.warn( /home/ubuntu/mmdeploy/mmdeploy/codebase/mmdet/models/dense_heads/rpn_head.py:78: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert cls_score.size()[-2:] == bbox_pred.size()[-2:] /home/ubuntu/mmdeploy/mmdeploy/pytorch/functions/topk.py:28: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. k = torch.tensor(k, device=input.device, dtype=torch.long) /home/ubuntu/mmdeploy/mmdeploy/codebase/mmdet/core/bbox/delta_xywh_bbox_coder.py:39: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert pred_bboxes.size(0) == bboxes.size(0) /home/ubuntu/mmdeploy/mmdeploy/codebase/mmdet/core/bbox/delta_xywh_bbox_coder.py:41: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert pred_bboxes.size(1) == bboxes.size(1) /home/ubuntu/mmdeploy/mmdeploy/codebase/mmdet/deploy/utils.py:47: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results. assert len(max_shape) == 2, 'max_shape should be [h, w]' /home/ubuntu/mmdeploy/mmdeploy/codebase/mmdet/core/post_processing/bbox_nms.py:92: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. iou_threshold = torch.tensor([iou_threshold], dtype=torch.float32) /home/ubuntu/mmdeploy/mmdeploy/codebase/mmdet/core/post_processing/bbox_nms.py:93: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. score_threshold = torch.tensor([score_threshold], dtype=torch.float32) /home/ubuntu/mmdeploy/mmdeploy/mmcv/ops/nms.py:40: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! score_threshold = float(score_threshold) /home/ubuntu/mmdeploy/mmdeploy/mmcv/ops/nms.py:41: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! iou_threshold = float(iou_threshold) /home/ubuntu/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/ops/nms.py:171: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert boxes.size(1) == 4 /home/ubuntu/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/ops/nms.py:172: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert boxes.size(0) == scores.size(0) /home/ubuntu/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/ops/nms.py:31: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if max_num > 0: /home/ubuntu/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/ops/roi_align.py:83: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert rois.size(1) == 5, 'RoI must be (idx, x1, y1, x2, y2)!' /home/ubuntu/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/onnx/symbolic_opset9.py:2905: UserWarning: Exporting aten::index operator of advanced indexing in opset 11 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results. warnings.warn("Exporting aten::index operator of advanced indexing in opset " + 2022-08-31 19:28:03,897 - mmdeploy - INFO - Execute onnx optimize passes. 2022-08-31 19:28:04,247 - mmdeploy - INFO - Finish pipeline mmdeploy.apis.pytorch2onnx.torch2onnx 2022-08-31 19:28:04,611 - mmdeploy - WARNING - "visualize_model" has been skipped may be because it's running on a headless device. 2022-08-31 19:28:04,611 - mmdeploy - INFO - All process success.

IuliuNovac commented 2 years ago

Just exported the model form here - https://github.com/WongKinYiu/yolov7. Works fine, but I need to add support for mmdet models.

tpoisonooo commented 2 years ago
  1. If you want a .engine (TRT format) file, DEPLOY_CONFIG should directly use configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py, not first convert to .onnx then .engine

  2. onnx -> engine fails: trt not support the TOPK with dynamic K value, please search topk from onnx file, check the input. If using detection_tensorrt_dynamicXXX, mmdeploy rewriter would rewrite the K value of topk to a fixed integer

  3. try this config to convert faster-rcnn:

    ✗ python3 tools/deploy.py configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py    ../mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py  https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth  ../mmdetection/demo/demo.jpg  --work-dir faster-rcnn  --device cuda:0  --show

If someone would like use cu102, please also cherry-pick this fix . Yes I have noticed that you are using cu113.

tpoisonooo commented 1 year ago

If you have more question, please open a new issue.