open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.31k stars 1.54k forks source link

[New Models] In BEVFusion, training unable to Achieve Official Reported Performance: NDS 71.13, mAP 68.36 #2967

Open JiankunShi opened 6 months ago

JiankunShi commented 6 months ago

Model/Dataset/Scheduler description

Content: Hello,

I have been training using the provided models bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d-5239b1af.pth and swint-nuimages-pretrained.pth. Due to GPU memory constraints, I reduced the batch_size in the config from 4 to 3, while keeping all other parameters unchanged. The specific config is as follows:

`base = [ './bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py' ] point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] input_modality = dict(use_lidar=True, use_camera=True) backend_args = None

model = dict( type='BEVFusion', data_preprocessor=dict( type='Det3DDataPreprocessor', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], bgr_to_rgb=False), img_backbone=dict( type='mmdet.SwinTransformer', embed_dims=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.2, patch_norm=True, out_indices=[1, 2, 3], with_cp=False, convert_weights=True, init_cfg=dict( type='Pretrained', checkpoint= # noqa: E251 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa: E501 )), img_neck=dict( type='GeneralizedLSSFPN', in_channels=[192, 384, 768], out_channels=256, start_level=0, num_outs=3, norm_cfg=dict(type='BN2d', requires_grad=True), act_cfg=dict(type='ReLU', inplace=True), upsample_cfg=dict(mode='bilinear', align_corners=False)), view_transform=dict( type='DepthLSSTransform', in_channels=256, out_channels=80, image_size=[256, 704], feature_size=[32, 88], xbound=[-54.0, 54.0, 0.3], ybound=[-54.0, 54.0, 0.3], zbound=[-10.0, 10.0, 20.0], dbound=[1.0, 60.0, 0.5], downsample=2), fusion_layer=dict( type='ConvFuser', in_channels=[80, 256], out_channels=256))

train_pipeline = [ dict( type='BEVLoadMultiViewImageFromFiles', to_float32=True, color_type='color', backend_args=backend_args), dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, backend_args=backend_args), dict( type='LoadPointsFromMultiSweeps', sweeps_num=9, load_dim=5, use_dim=5, pad_empty_sweeps=True, remove_close=True, backend_args=backend_args), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_attr_label=False), dict( type='ImageAug3D', final_dim=[256, 704], resize_lim=[0.38, 0.55], bot_pct_lim=[0.0, 0.0], rot_lim=[-5.4, 5.4], rand_flip=True, is_train=True), dict( type='BEVFusionGlobalRotScaleTrans', scale_ratio_range=[0.9, 1.1], rot_range=[-0.78539816, 0.78539816], translation_std=0.5), dict(type='BEVFusionRandomFlip3D'), dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), dict( type='ObjectNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]),

Actually, 'GridMask' is not used here

dict(
    type='GridMask',
    use_h=True,
    use_w=True,
    max_epoch=6,
    rotate=1,
    offset=False,
    ratio=0.5,
    mode=1,
    prob=0.0,
    fixed_prob=True),
dict(type='PointShuffle'),
dict(
    type='Pack3DDetInputs',
    keys=[
        'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
        'gt_labels'
    ],
    meta_keys=[
        'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
        'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
        'lidar_path', 'img_path', 'transformation_3d_flow', 'pcd_rotation',
        'pcd_scale_factor', 'pcd_trans', 'img_aug_matrix',
        'lidar_aug_matrix', 'num_pts_feats'
    ])

]

test_pipeline = [ dict( type='BEVLoadMultiViewImageFromFiles', to_float32=True, color_type='color', backend_args=backend_args), dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, backend_args=backend_args), dict( type='LoadPointsFromMultiSweeps', sweeps_num=9, load_dim=5, use_dim=5, pad_empty_sweeps=True, remove_close=True, backend_args=backend_args), dict( type='ImageAug3D', final_dim=[256, 704], resize_lim=[0.48, 0.48], bot_pct_lim=[0.0, 0.0], rot_lim=[0.0, 0.0], rand_flip=False, is_train=False), dict( type='PointsRangeFilter', point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]), dict( type='Pack3DDetInputs', keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'], meta_keys=[ 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', 'lidar_path', 'img_path', 'num_pts_feats' ]) ]

train_dataloader = dict( dataset=dict( dataset=dict(pipeline=train_pipeline, modality=input_modality))) val_dataloader = dict( dataset=dict(pipeline=test_pipeline, modality=input_modality)) test_dataloader = val_dataloader

param_scheduler = [ dict( type='LinearLR', start_factor=0.33333333, by_epoch=False, begin=0, end=500), dict( type='CosineAnnealingLR', begin=0, T_max=6, end=6, by_epoch=True, eta_min_ratio=1e-4, convert_to_iter_based=True),

momentum scheduler

# During the first 8 epochs, momentum increases from 1 to 0.85 / 0.95
# during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
dict(
    type='CosineAnnealingMomentum',
    eta_min=0.85 / 0.95,
    begin=0,
    end=2.4,
    by_epoch=True,
    convert_to_iter_based=True),
dict(
    type='CosineAnnealingMomentum',
    eta_min=1,
    begin=2.4,
    end=6,
    by_epoch=True,
    convert_to_iter_based=True)

]

runtime settings

train_cfg = dict(by_epoch=True, max_epochs=6, val_interval=1) val_cfg = dict() test_cfg = dict() find_unused_parameters=True optim_wrapper = dict( type='OptimWrapper', optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.01), clip_grad=dict(max_norm=30, norm_type=2))

Default setting for scaling LR automatically

- enable means enable scaling LR automatically

or not by default.

- base_batch_size = (8 GPUs) x (4 samples per GPU).

auto_scale_lr = dict(enable=False, base_batch_size=32)

default_hooks = dict( logger=dict(type='LoggerHook', interval=50), checkpoint=dict(type='CheckpointHook', interval=1))

del base.custom_hooks `

However, the training results I achieved were NDS 69.28 and mAP 64.55, which significantly deviate from the expected results. Could you please advise on potential adjustments or any steps I might take to improve these outcomes?

Thank you for your assistance!

Open source status

Provide useful links for the implementation

No response

JiankunShi commented 6 months ago

By the way, my training instructions are as follows: torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29501 tools/train.py configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py --cfg-option load_from=./bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d-2628f933.pth model.img_backbone.init_cfg.checkpoint=./swint-nuimages-pretrained.pth --launcher pytorch

Mingshouqun commented 5 months ago

Hello, I have the same question. Could you please tell me roughly how much your final loss averages?

2397025988 commented 3 months ago

Hi [JiankunShi] @JiankunShi ,if you want to change batch size, it is customary to set to even. And in the same time the lr also needs to be scaled in the same proportion as the batch size.

Da1symeeting1 commented 3 months ago

@JiankunShi 你好,请问你有尝试过增加bevfusion的训练轮次吗,我尝试10epoch的时候在第7个epoch就开始loss上升到16左右了

mdessl commented 1 month ago

@JiankunShi I suspect the issue comes from not setting the line auto_scale_lr = dict(enable=False, base_batch_size=32) to enable=True and the fact that the batch size is different. If i am not mistaken the bs in your example is (2GPUs x 3) whereas in the provided code it's 32 (8GPUs x 4).