[Bug] Assertion failed: axis >= 0 && axis < nbDims during MMDet to TensorRT Conversion

IanMcC123 commented 1 year ago

Checklist

[X] I have searched related issues but cannot get the expected help.
[X] 2. I have read the FAQ documentation but cannot get the expected help.
[X] 3. The bug has not been fixed in the latest version.

Describe the bug

I am currently looking to convert an MMDetection, convNext segmentation model to TensorRT in order to speed up inference time. I have employed multiple strategies/scripts in order to do this, but have not been able to get past the onnx to TRT conversion.

Reproduction

Before converting to TensorRT, the model must be converted into an ONNX model. To do so, I first tried using the script provided by MMDetection to convert to ONNX. Note that this script has since been deprecated. I was unable to run the script successfully, so I tried using the newer script provided by MMDeploy, and was unable to successfully convert. Using both of these led me to run into the subsequent error. I have not been able to find a resolution in any existing issues related to this. I also used the provided docker image from MMDeploy when converting, and ran into the same issue.

I have since written a program myself loosely based on both of the previously mentioned scripts, but I run into the same problem when passing in the MMDetection convNext model and the demo image provided by MMLab. Here is the minimal script:

temp_onnx_path = '/tmp/panoptic.onnx'
temp_onnx_dir = os.path.dirname(temp_onnx_path)
if not os.path.isdir(temp_onnx_dir) and temp_onnx_dir != "":
        os.makedirs(temp_onnx_dir)

cfg = Config.fromfile(args.config)
    normalize_cfg = parse_normalize_cfg(cfg.test_pipeline)

input_config = {
        'input_shape': (1,3,640,480),
        'input_path': 'demo.jpg',
        'normalize_cfg': normalize_cfg
    }

wrapped_model, tensor_data = generate_inputs_and_wrap_model(args.config, args.weights, input_config)

output_names = ['dets', 'labels']
    if wrapped_model.with_mask:
        output_names.append('masks')

torch.onnx.export(
        wrapped_model,
        tensor_data,
        temp_onnx_path,
        input_names=['input'],
        output_names=output_names,
        export_params=True, 
        keep_initializers_as_inputs=True,
        do_constant_folding=True,
        verbose=False,
        opset_version=11,
        dynamic_axes=None
        )

del wrapped_model
torch.cuda.empty_cache()

onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model, full_check=True)

if trt_path.endswith(os.path.sep):
   raise AttributeError(f"trt_path provided ({trt_path}) must be a file path not a directory")
if not trt_path.endswith('.trt'):
    trt_path += '.trt'
if not os.path.isdir(os.path.dirname(trt_path)):
    os.makedirs(os.path.dirname(trt_path))
command = ['/opt/tensorrt/bin/trtexec', f'--onnx={onnx_path}', f'--saveEngine={trt_path}']
subprocess.run(command)

Here is a reference to the convNext model config I am using.

The TensorRT version used in the working image is 7.1.3.4, but after launching the container, I have to upgrade the tensorrt version in order to be able to run the conversion. What is the correct way to do this whether inside or outside of the container? I would like to ensure that there is not an issue with an incompatible trtexec file. In order to rule this out as a problem, I attempted to use an nvidia image for TensorRT (docker run --gpus all -it --rm nvcr.io/nvidia/tensorrt:20.09-py3) to run the conversion. I ran into the same problem. The following information are results using my working image.

The following is the result of running docker exec 'container' env:

Environment

QT_X11_NO_MITSHM=1
DISPLAY=:1
QT_GRAPHICSSYSTEM=native
CUDA_VERSION=11.0.221
CUDA_DRIVER_VERSION=450.51.06
CUDA_CACHE_DISABLE=1
_CUDA_COMPAT_PATH=/usr/local/cuda/compat
ENV=/etc/shinit_v2
BASH_ENV=/etc/bash.bashrc
NVIDIA_REQUIRE_CUDA=cuda>=9.0
NCCL_VERSION=2.7.8
CUBLAS_VERSION=11.2.0.252
CUFFT_VERSION=10.2.1.245
CURAND_VERSION=10.2.1.245
CUSPARSE_VERSION=11.1.1.245
CUSOLVER_VERSION=10.6.0.245
NPP_VERSION=11.1.0.245
NVJPEG_VERSION=11.1.1.245
CUDNN_VERSION=8.0.4.12
TRT_VERSION=7.1.3.4
TRTOSS_VERSION=20.09
NSIGHT_SYSTEMS_VERSION=2020.3.2.6
NSIGHT_COMPUTE_VERSION=2020.1.2.4
DALI_VERSION=0.25.1
DALI_BUILD=1612461
DLPROF_VERSION=20.09
LD_LIBRARY_PATH=/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-10.0/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
MOFED_VERSION=4.6-1.0.1
IBV_DRIVERS=/usr/lib/libibverbs/libmlx5
OPENUCX_VERSION=1.6.1
OPENMPI_VERSION=3.1.6
LIBRARY_PATH=/usr/local/cuda/lib64/stubs:
TENSORRT_VERSION=8.6.1
NVIDIA_TENSORRT_VERSION=20.09
NVIDIA_BUILD_ID=15985252
DEBIAN_FRONTEND=noninteractive
USER=root
HOME=/root
ROS_DISTRO=melodic
ROS_PYTHON_VERSION=3
PYTHON_VERSION=python3.7

The following are relevant package versions inside the container taken from the requirements file:

mmcls==0.23.2
mmcv-full==1.5.3
mmdet==2.25.1
numpy==1.21.6
onnx==1.10.2
onnx-simplifier==0.4.1
opencv-python==4.6.0.66
tensorboard==1.15.0
-f https://download.pytorch.org/whl/torch_stable.html
torch==1.10.0+cu111
torchvision==0.11.0+cu111
torchaudio==0.10.0

NOTE: I have also tried upgrading/downgrading mmcls, mmdet, mmcv-full, onnx, torch, torchvision, and torchaudio, but have at most only been able to reach the following error.

Error traceback

/usr/local/lib/python3.7/dist-packages/torch/onnx/symbolic_opset9.py:2819: UserWarning: Exporting aten::index operator of advanced indexing in opset 11 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results.
...
WARNING: The shape inference of mmcv::grid_sampler type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Onnx conversion finished
Checking onnx model
Warning: Unsupported operator grid_sampler. No schema registered for this operator.
--------------------------------------------------------------
&&&& RUNNING TensorRT.trtexec # /opt/tensorrt/bin/trtexec --onnx=/panoptic.onnx --saveEngine=/external/engine1.trt
[08/30/2023-15:28:02] [I] === Model Options ===
[08/30/2023-15:28:02] [I] Format: ONNX
[08/30/2023-15:28:02] [I] Model: /tmp/panoptic.onnx
[08/30/2023-15:28:02] [I] Output:
[08/30/2023-15:28:02] [I] === Build Options ===
[08/30/2023-15:28:02] [I] Max batch: 1
[08/30/2023-15:28:02] [I] Workspace: 16 MB
[08/30/2023-15:28:02] [I] minTiming: 1
[08/30/2023-15:28:02] [I] avgTiming: 8
[08/30/2023-15:28:02] [I] Precision: FP32
[08/30/2023-15:28:02] [I] Calibration: 
[08/30/2023-15:28:02] [I] Safe mode: Disabled
[08/30/2023-15:28:02] [I] Save engine: /external/engine1.trt
[08/30/2023-15:28:02] [I] Load engine: 
[08/30/2023-15:28:02] [I] Builder Cache: Enabled
[08/30/2023-15:28:02] [I] NVTX verbosity: 0
[08/30/2023-15:28:02] [I] Inputs format: fp32:CHW
[08/30/2023-15:28:02] [I] Outputs format: fp32:CHW
[08/30/2023-15:28:02] [I] Input build shapes: model
[08/30/2023-15:28:02] [I] Input calibration shapes: model
[08/30/2023-15:28:02] [I] === System Options ===
[08/30/2023-15:28:02] [I] Device: 0
[08/30/2023-15:28:02] [I] DLACore: 
[08/30/2023-15:28:02] [I] Plugins:
[08/30/2023-15:28:02] [I] === Inference Options ===
[08/30/2023-15:28:02] [I] Batch: 1
[08/30/2023-15:28:02] [I] Input inference shapes: model
[08/30/2023-15:28:02] [I] Iterations: 10
[08/30/2023-15:28:02] [I] Duration: 3s (+ 200ms warm up)
[08/30/2023-15:28:02] [I] Sleep time: 0ms
[08/30/2023-15:28:02] [I] Streams: 1
[08/30/2023-15:28:02] [I] ExposeDMA: Disabled
[08/30/2023-15:28:02] [I] Spin-wait: Disabled
[08/30/2023-15:28:02] [I] Multithreading: Disabled
[08/30/2023-15:28:02] [I] CUDA Graph: Disabled
[08/30/2023-15:28:02] [I] Skip inference: Disabled
[08/30/2023-15:28:02] [I] Inputs:
[08/30/2023-15:28:02] [I] === Reporting Options ===
[08/30/2023-15:28:02] [I] Verbose: Disabled
[08/30/2023-15:28:02] [I] Averages: 10 inferences
[08/30/2023-15:28:02] [I] Percentile: 99
[08/30/2023-15:28:02] [I] Dump output: Disabled
[08/30/2023-15:28:02] [I] Profile: Disabled
[08/30/2023-15:28:02] [I] Export timing to JSON file: 
[08/30/2023-15:28:02] [I] Export output to JSON file: 
[08/30/2023-15:28:02] [I] Export profile to JSON file: 
[08/30/2023-15:28:02] [I] 
----------------------------------------------------------------
Input filename:   /tmp/panoptic.onnx
ONNX IR version:  0.0.7
Opset version:    11
Producer name:    pytorch
Producer version: 1.10
Domain:           
Model version:    0
Doc string:
----------------------------------------------------------------
[08/30/2023-15:28:07] [W] [TRT] [TRT]/home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/30/2023-15:28:07] [I] [TRT] MatMul_30: broadcasting input1 to make tensors conform, dims(input0)=[1,160,120,96][NONE] dims(input1)=[1,1,96,384][NONE].
....

....
[08/30/2023-15:28:07] [I] [TRT] MatMul_620: broadcasting input1 to make tensors conform, dims(input0)=[1,20,15,3072][NONE] dims(input1)=[1,1,3072,768][NONE].
While parsing node number 752 [Unsqueeze]:
ERROR: /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:188 In function convertAxis:
[8] Assertion failed: axis >= 0 && axis < nbDims
[08/30/2023-15:28:07] [E] Failed to parse onnx file
[08/30/2023-15:28:07] [E] Parsing model failed
[08/30/2023-15:28:07] [E] Engine creation failed
[08/30/2023-15:28:07] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /opt/tensorrt/bin/trtexec --onnx=/tmp/panoptic.onnx --saveEngine=/external/engine1.trt
Saved converted tensorrt engine at /external/engine1.trt

RunningLeon commented 1 year ago

hi,

ConvNext is not supported by mmdeploy up to now base on this doc. There might be some errors that you need to solve.
As the warning suggests, grid_sampler is a custom op for TensorRT. And you have to build mmdeploy custom ops for trt and load the lib by adding --plugins=mmdeploy/lib/libmmdeploy_tensorrt_ops.so for trtexec command.

Unsupported operator grid_sampler. No schema registered for this operator.

github-actions[bot] commented 1 year ago

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

github-actions[bot] commented 1 year ago

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

open-mmlab / mmdeploy