open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5k stars 1.49k forks source link

[New Models] In BEVFusion, training unable to Achieve Official Reported Performance: NDS 71.13, mAP 68.36 #2967

Open JiankunShi opened 2 months ago

JiankunShi commented 2 months ago

Model/Dataset/Scheduler description

Content: Hello,

I have been training using the provided models bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d-5239b1af.pth and swint-nuimages-pretrained.pth. Due to GPU memory constraints, I reduced the batch_size in the config from 4 to 3, while keeping all other parameters unchanged. The specific config is as follows:

`base = [ './bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py' ] point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] input_modality = dict(use_lidar=True, use_camera=True) backend_args = None

model = dict( type='BEVFusion', data_preprocessor=dict( type='Det3DDataPreprocessor', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], bgr_to_rgb=False), img_backbone=dict( type='mmdet.SwinTransformer', embed_dims=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.2, patch_norm=True, out_indices=[1, 2, 3], with_cp=False, convert_weights=True, init_cfg=dict( type='Pretrained', checkpoint= # noqa: E251 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa: E501 )), img_neck=dict( type='GeneralizedLSSFPN', in_channels=[192, 384, 768], out_channels=256, start_level=0, num_outs=3, norm_cfg=dict(type='BN2d', requires_grad=True), act_cfg=dict(type='ReLU', inplace=True), upsample_cfg=dict(mode='bilinear', align_corners=False)), view_transform=dict( type='DepthLSSTransform', in_channels=256, out_channels=80, image_size=[256, 704], feature_size=[32, 88], xbound=[-54.0, 54.0, 0.3], ybound=[-54.0, 54.0, 0.3], zbound=[-10.0, 10.0, 20.0], dbound=[1.0, 60.0, 0.5], downsample=2), fusion_layer=dict( type='ConvFuser', in_channels=[80, 256], out_channels=256))

train_pipeline = [ dict( type='BEVLoadMultiViewImageFromFiles', to_float32=True, color_type='color', backend_args=backend_args), dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, backend_args=backend_args), dict( type='LoadPointsFromMultiSweeps', sweeps_num=9, load_dim=5, use_dim=5, pad_empty_sweeps=True, remove_close=True, backend_args=backend_args), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_attr_label=False), dict( type='ImageAug3D', final_dim=[256, 704], resize_lim=[0.38, 0.55], bot_pct_lim=[0.0, 0.0], rot_lim=[-5.4, 5.4], rand_flip=True, is_train=True), dict( type='BEVFusionGlobalRotScaleTrans', scale_ratio_range=[0.9, 1.1], rot_range=[-0.78539816, 0.78539816], translation_std=0.5), dict(type='BEVFusionRandomFlip3D'), dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), dict( type='ObjectNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]),

Actually, 'GridMask' is not used here

dict(
    type='GridMask',
    use_h=True,
    use_w=True,
    max_epoch=6,
    rotate=1,
    offset=False,
    ratio=0.5,
    mode=1,
    prob=0.0,
    fixed_prob=True),
dict(type='PointShuffle'),
dict(
    type='Pack3DDetInputs',
    keys=[
        'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
        'gt_labels'
    ],
    meta_keys=[
        'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
        'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
        'lidar_path', 'img_path', 'transformation_3d_flow', 'pcd_rotation',
        'pcd_scale_factor', 'pcd_trans', 'img_aug_matrix',
        'lidar_aug_matrix', 'num_pts_feats'
    ])

]

test_pipeline = [ dict( type='BEVLoadMultiViewImageFromFiles', to_float32=True, color_type='color', backend_args=backend_args), dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, backend_args=backend_args), dict( type='LoadPointsFromMultiSweeps', sweeps_num=9, load_dim=5, use_dim=5, pad_empty_sweeps=True, remove_close=True, backend_args=backend_args), dict( type='ImageAug3D', final_dim=[256, 704], resize_lim=[0.48, 0.48], bot_pct_lim=[0.0, 0.0], rot_lim=[0.0, 0.0], rand_flip=False, is_train=False), dict( type='PointsRangeFilter', point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]), dict( type='Pack3DDetInputs', keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'], meta_keys=[ 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', 'lidar_path', 'img_path', 'num_pts_feats' ]) ]

train_dataloader = dict( dataset=dict( dataset=dict(pipeline=train_pipeline, modality=input_modality))) val_dataloader = dict( dataset=dict(pipeline=test_pipeline, modality=input_modality)) test_dataloader = val_dataloader

param_scheduler = [ dict( type='LinearLR', start_factor=0.33333333, by_epoch=False, begin=0, end=500), dict( type='CosineAnnealingLR', begin=0, T_max=6, end=6, by_epoch=True, eta_min_ratio=1e-4, convert_to_iter_based=True),

momentum scheduler

# During the first 8 epochs, momentum increases from 1 to 0.85 / 0.95
# during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
dict(
    type='CosineAnnealingMomentum',
    eta_min=0.85 / 0.95,
    begin=0,
    end=2.4,
    by_epoch=True,
    convert_to_iter_based=True),
dict(
    type='CosineAnnealingMomentum',
    eta_min=1,
    begin=2.4,
    end=6,
    by_epoch=True,
    convert_to_iter_based=True)

]

runtime settings

train_cfg = dict(by_epoch=True, max_epochs=6, val_interval=1) val_cfg = dict() test_cfg = dict() find_unused_parameters=True optim_wrapper = dict( type='OptimWrapper', optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.01), clip_grad=dict(max_norm=30, norm_type=2))

Default setting for scaling LR automatically

- enable means enable scaling LR automatically

or not by default.

- base_batch_size = (8 GPUs) x (4 samples per GPU).

auto_scale_lr = dict(enable=False, base_batch_size=32)

default_hooks = dict( logger=dict(type='LoggerHook', interval=50), checkpoint=dict(type='CheckpointHook', interval=1))

del base.custom_hooks `

However, the training results I achieved were NDS 69.28 and mAP 64.55, which significantly deviate from the expected results. Could you please advise on potential adjustments or any steps I might take to improve these outcomes?

Thank you for your assistance!

Open source status

Provide useful links for the implementation

No response

JiankunShi commented 2 months ago

By the way, my training instructions are as follows: torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29501 tools/train.py configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py --cfg-option load_from=./bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d-2628f933.pth model.img_backbone.init_cfg.checkpoint=./swint-nuimages-pretrained.pth --launcher pytorch

Mingshouqun commented 3 weeks ago

Hello, I have the same question. Could you please tell me roughly how much your final loss averages?