grimoire / mmdetection-to-tensorrt

convert mmdetection model to tensorrt, support fp16, int8, batch input, dynamic shape etc.
Apache License 2.0
590 stars 85 forks source link

#assertion /amirstan_plugin/src/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp,127 #117

Closed Kaeseknacker closed 2 years ago

Kaeseknacker commented 2 years ago

While model inference (python or c++) I get the following assertion:

init trt model
[05/25/2022-13:02:10] [TRT] [W] TensorRT was linked against cuBLAS/cuBLASLt 11.6.5 but loaded cuBLAS/cuBLASLt 11.5.1
[05/25/2022-13:02:10] [TRT] [W] TensorRT was linked against cuBLAS/cuBLASLt 11.6.5 but loaded cuBLAS/cuBLASLt 11.5.1
Can not load dataset from config. Use default CLASSES instead.
 took 2.809 s
------------------------
load image(s)
 took 0.036 s
------------------------
warm up detector
/home/spraul/Code/mmdetection_2.22/mmdetection/mmdet/datasets/utils.py:70: UserWarning: "ImageToTensor" pipeline is replaced by "DefaultFormatBundle" for batch inference. It is recommended to manually replace it in the test data pipeline in your config file.
  'data pipeline in your config file.', UserWarning)
#assertion/home/spraul/Code/amirstan_plugin/src/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp,127

environment:

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: 11.0.1-2
CMake version: version 3.18.4
Libc version: glibc-2.17
Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.10.0-8-amd64-x86_64-with-debian-11.0
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration: GPU 0: GeForce RTX 2070 SUPER
Nvidia driver version: 460.84
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] mmcv-full==1.5.1
[pip3] mmdet==2.24.0
[pip3] mmdet2trt==0.5.0
[pip3] tensorrt==8.2.4.2
[pip3] torch==1.11.0
[pip3] torch2trt-dynamic==0.5.0
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mmcv-full                 1.5.1                    pypi_0    pypi
[conda] mmdet                     2.24.0                    dev_0    <develop>
[conda] mmdet2trt                 0.5.0                     dev_0    <develop>
[conda] pytorch                   1.11.0          py3.7_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] tensorrt                  8.2.4.2                  pypi_0    pypi
[conda] torch2trt-dynamic         0.5.0                     dev_0    <develop>
[conda] torchaudio                0.11.0               py37_cu113    pytorch
[conda] torchvision               0.12.0               py37_cu113    pytorch

Conversion seems to be fine. Log:

mmdet2trt --save-engine=true --min-scale 1 3 1056 1056 --opt-scale 1 3 1088 1920 --max-scale 1 3 1952 1952 ../pretrained_models/faster_rcnn_x101_64x4d_fpn_1x_coco.py ../pretrained_models/faster_rcnn_x101_64x4d_fpn_1x_coco_20200204-833ee192.pth trt-detector-coco_frcnn-x101_trt8.trt --fp16 True |& tee trt-detector-coco_frcnn-x101_trt8.log
[05/25/2022-12:44:43] [TRT] [I] [MemUsageChange] Init CUDA: CPU +316, GPU +0, now: CPU 2764, GPU 3073 (MiB)
[05/25/2022-12:44:43] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 2764 MiB, GPU 3073 MiB
[05/25/2022-12:44:43] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 2899 MiB, GPU 3105 MiB
[05/25/2022-12:44:44] [TRT] [W] IElementWiseLayer with inputs (Unnamed Layer* 1412) [ElementWise]_output and (Unnamed Layer* 1416) [Shuffle]_output: first input has type Float but second input has type Int32.
[05/25/2022-12:44:44] [TRT] [W] IElementWiseLayer with inputs (Unnamed Layer* 1421) [ElementWise]_output and (Unnamed Layer* 1425) [Shuffle]_output: first input has type Float but second input has type Int32.
[05/25/2022-12:44:44] [TRT] [W] IElementWiseLayer with inputs (Unnamed Layer* 1430) [ElementWise]_output and (Unnamed Layer* 1434) [Shuffle]_output: first input has type Float but second input has type Int32.
[05/25/2022-12:44:44] [TRT] [W] IElementWiseLayer with inputs (Unnamed Layer* 1439) [ElementWise]_output and (Unnamed Layer* 1443) [Shuffle]_output: first input has type Float but second input has type Int32.
[05/25/2022-12:44:45] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3628, GPU 2055 (MiB)
[05/25/2022-12:44:45] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +117, GPU +56, now: CPU 3745, GPU 2111 (MiB)
[05/25/2022-12:44:45] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[05/25/2022-12:44:52] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[05/25/2022-12:48:15] [TRT] [I] Detected 1 inputs and 4 output network tensors.
[05/25/2022-12:48:15] [TRT] [I] Total Host Persistent Memory: 338960
[05/25/2022-12:48:15] [TRT] [I] Total Device Persistent Memory: 205533696
[05/25/2022-12:48:15] [TRT] [I] Total Scratch Memory: 4537344
[05/25/2022-12:48:15] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 528 MiB, GPU 974 MiB
[05/25/2022-12:48:16] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 327.817ms to assign 41 blocks to 326 nodes requiring 638002188 bytes.
[05/25/2022-12:48:16] [TRT] [I] Total Activation Memory: 638002188
[05/25/2022-12:48:16] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4459, GPU 2637 (MiB)
[05/25/2022-12:48:16] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 4459, GPU 2647 (MiB)
[05/25/2022-12:48:16] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +195, GPU +221, now: CPU 195, GPU 221 (MiB)
[05/25/2022-12:48:16] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4146, GPU 2623 (MiB)
[05/25/2022-12:48:16] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4146, GPU 2631 (MiB)
[05/25/2022-12:48:16] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +804, now: CPU 195, GPU 1025 (MiB)
/home/spraul/Code/mmdetection_2.22/mmdetection/mmdet/models/dense_heads/anchor_head.py:123: UserWarning: DeprecationWarning: anchor_generator is deprecated, please use "prior_generator" instead
  warnings.warn('DeprecationWarning: anchor_generator is deprecated, '
/home/spraul/Code/mmdetection_2.22/mmdetection/mmdet/core/anchor/anchor_generator.py:370: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors`` 
  '``single_level_grid_anchors`` would be deprecated soon. '
load checkpoint from local path: ../pretrained_models/faster_rcnn_x101_64x4d_fpn_1x_coco_20200204-833ee192.pth
Warning: Encountered known unsupported method torch.Tensor.new_tensor
Warning: Encountered known unsupported method torch.Tensor.new_tensor
grimoire commented 2 years ago

Sorry for the late reply. Would you mind sharing the model config you use?

Kaeseknacker commented 2 years ago
############################
image_scale_test=(1920,1080)
############################
model = dict(
    type='FasterRCNN',
    backbone=dict(
        type='ResNeXt',
        depth=101,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(
            type='Pretrained', checkpoint='open-mmlab://resnext101_64x4d'),
        groups=64,
        base_width=4),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=80,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=False,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.05,
            nms=dict(type='nms', iou_threshold=0.5),
            max_per_img=100)))
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=image_scale_test,
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_train2017.json',
        img_prefix='data/coco/train2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ]),
    val=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        ann_file='data/coco/annotations/instances_val2017.json',
        img_prefix='data/coco/val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=image_scale_test,
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'

It is a standard FRCNN trained on MS COCO (downloaded from MMDetection). I only changed the test image scale. Are there problems with older config files (and/or weight files) with more recent MMDetection versions?

Kaeseknacker commented 2 years ago

I found the problem: My NVIDIA driver was too old. After updating it from 460.84 (CUDA Version 11.2) to 510.73.05 (CUDA Version 11.6) it works.