open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.34k stars 1.55k forks source link

[Bug] Is training accuracy related to batch_size in bevfusion? #2897

Open wzqforever opened 9 months ago

wzqforever commented 9 months ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment


System environment: sys.platform: linux Python: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] CUDA available: True numpy_random_seed: 1155412052 GPU 0,1,2,3,4,5: Tesla V100S-PCIE-32GB CUDA_HOME: /home/guanjingchao/cuda-11.6 NVCC: Cuda compilation tools, release 11.6, V11.6.55 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.13.1+cu116 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 1155412052 Distributed launcher: pytorch Distributed training: True GPU number: 6

Reproduces the problem - code sample

base = ['../../../configs/base/default_runtime.py'] custom_imports = dict( imports=['projects.BEVFusion.bevfusion'], allow_failed_imports=False)

model settings

Voxel size for voxel encoder

Usually voxel size is changed consistently with the point cloud range

If point cloud range is modified, do remember to change all related

keys in the config.

voxel_size = [0.075, 0.075, 0.2] point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] class_names = [ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]

metainfo = dict(classes=class_names) dataset_type = 'NuScenesDataset' data_root = '/home/guanjingchao/datasets/nuscenes/' # 完整nuScenes数据集

data_root = '/home/guanjingchao/datasets/nuscenes-mini/' # mini nuScenes数据集

data_prefix = dict( pts='samples/LIDAR_TOP', CAM_FRONT='samples/CAM_FRONT', CAM_FRONT_LEFT='samples/CAM_FRONT_LEFT', CAM_FRONT_RIGHT='samples/CAM_FRONT_RIGHT', CAM_BACK='samples/CAM_BACK', CAM_BACK_RIGHT='samples/CAM_BACK_RIGHT', CAM_BACK_LEFT='samples/CAM_BACK_LEFT', sweeps='sweeps/LIDAR_TOP') input_modality = dict(use_lidar=True, use_camera=False)

backend_args = dict(

backend='petrel',

path_mapping=dict({

'./data/nuscenes/':

's3://openmmlab/datasets/detection3d/nuscenes/',

'data/nuscenes/':

's3://openmmlab/datasets/detection3d/nuscenes/',

'./data/nuscenes_mini/':

's3://openmmlab/datasets/detection3d/nuscenes/',

'data/nuscenes_mini/':

's3://openmmlab/datasets/detection3d/nuscenes/'

}))

backend_args = None

model = dict( type='BEVFusion', data_preprocessor=dict( type='Det3DDataPreprocessor', pad_size_divisor=32, voxelize_cfg=dict( max_num_points=10, point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0], voxel_size=[0.075, 0.075, 0.2], max_voxels=[120000, 160000], voxelize_reduce=True)),

pts_voxel_encoder=dict(type='HardSimpleVFE', num_features=5), # 这个已经写在bevfusion.py的voxelize函数里面了, 因此这个是无效的

pts_middle_encoder=dict(
    type='BEVFusionSparseEncoder',
    in_channels=5,
    sparse_shape=[1440, 1440, 41],
    order=('conv', 'norm', 'act'),
    norm_cfg=dict(type='BN1d', eps=0.001, momentum=0.01),
    encoder_channels=((16, 16, 32), (32, 32, 64), (64, 64, 128), (128,
                                                                  128)),
    encoder_paddings=((0, 0, 1), (0, 0, 1), (0, 0, (1, 1, 0)), (0, 0)),
    block_type='basicblock'),
pts_backbone=dict(
    type='SECOND',    # 主干网络
    in_channels=256,
    out_channels=[128, 256],
    layer_nums=[5, 5],
    layer_strides=[1, 2],
    norm_cfg=dict(type='BN', eps=0.001, momentum=0.01),
    conv_cfg=dict(type='Conv2d', bias=False)),
pts_neck=dict(
    type='SECONDFPN',   # 颈部网络
    in_channels=[128, 256],
    out_channels=[256, 256],
    upsample_strides=[1, 2],
    norm_cfg=dict(type='BN', eps=0.001, momentum=0.01),
    upsample_cfg=dict(type='deconv', bias=False),
    use_conv_for_no_stride=True),
bbox_head=dict(
    type='TransFusionHead',
    num_proposals=200,
    auxiliary=True,
    in_channels=512,
    hidden_channel=128,
    num_classes=10,
    nms_kernel_size=3,
    bn_momentum=0.1,
    num_decoder_layers=1,
    decoder_layer=dict(
        type='TransformerDecoderLayer',
        self_attn_cfg=dict(embed_dims=128, num_heads=8, dropout=0.1),
        cross_attn_cfg=dict(embed_dims=128, num_heads=8, dropout=0.1),
        ffn_cfg=dict(
            embed_dims=128,
            feedforward_channels=256,
            num_fcs=2,
            ffn_drop=0.1,
            act_cfg=dict(type='ReLU', inplace=True),
        ),
        norm_cfg=dict(type='LN'),
        pos_encoding_cfg=dict(input_channel=2, num_pos_feats=128)),
    train_cfg=dict(
        dataset='nuScenes',
        point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0],
        grid_size=[1440, 1440, 41],
        voxel_size=[0.075, 0.075, 0.2],
        out_size_factor=8,
        gaussian_overlap=0.1,
        min_radius=2,
        pos_weight=-1,
        code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2],
        assigner=dict(
            type='HungarianAssigner3D',
            iou_calculator=dict(type='BboxOverlaps3D', coordinate='lidar'),
            cls_cost=dict(
                type='mmdet.FocalLossCost',
                gamma=2.0,
                alpha=0.25,
                weight=0.15),
            reg_cost=dict(type='BBoxBEVL1Cost', weight=0.25),
            iou_cost=dict(type='IoU3DCost', weight=0.25))),
    test_cfg=dict(
        dataset='nuScenes',
        grid_size=[1440, 1440, 41],
        out_size_factor=8,
        voxel_size=[0.075, 0.075],
        pc_range=[-54.0, -54.0],
        nms_type=None),
    common_heads=dict(
        center=[2, 2], height=[1, 2], dim=[3, 2], rot=[2, 2], vel=[2, 2]),
    bbox_coder=dict(
        type='TransFusionBBoxCoder',
        pc_range=[-54.0, -54.0],
        post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0],
        score_threshold=0.0,
        out_size_factor=8,
        voxel_size=[0.075, 0.075],
        code_size=10),
    loss_cls=dict(
        type='mmdet.FocalLoss',
        use_sigmoid=True,
        gamma=2.0,
        alpha=0.25,
        reduction='mean',
        loss_weight=1.0),
    loss_heatmap=dict(
        type='mmdet.GaussianFocalLoss', reduction='mean', loss_weight=1.0),
    loss_bbox=dict(
        type='mmdet.L1Loss', reduction='mean', loss_weight=0.25)))

db_sampler = dict( data_root=data_root, info_path=data_root + 'nuscenes_dbinfos_train.pkl', rate=1.0, prepare=dict( filter_by_difficulty=[-1], filter_by_min_points=dict( car=5, truck=5, bus=5, trailer=5, construction_vehicle=5, traffic_cone=5, barrier=5, motorcycle=5, bicycle=5, pedestrian=5)), classes=class_names, sample_groups=dict( car=2, truck=3, construction_vehicle=7, bus=4, trailer=6, barrier=2, motorcycle=6, bicycle=6, pedestrian=2, traffic_cone=2), points_loader=dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=[0, 1, 2, 3, 4], backend_args=backend_args))

train_pipeline = [ dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, backend_args=backend_args), dict( type='LoadPointsFromMultiSweeps', sweeps_num=9, load_dim=5, use_dim=5, pad_empty_sweeps=True, remove_close=True, backend_args=backend_args), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_attr_label=False), dict(type='ObjectSample', db_sampler=db_sampler), dict( type='GlobalRotScaleTrans', scale_ratio_range=[0.9, 1.1], rot_range=[-0.78539816, 0.78539816], translation_std=0.5), dict(type='BEVFusionRandomFlip3D'), dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), dict( type='ObjectNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict(type='PointShuffle'), dict( type='Pack3DDetInputs', keys=[ 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', 'gt_labels' ], meta_keys=[ 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', 'lidar_path', 'img_path', 'transformation_3d_flow', 'pcd_rotation', 'pcd_scale_factor', 'pcd_trans', 'img_aug_matrix', 'lidar_aug_matrix' ]) ]

test_pipeline = [ dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, backend_args=backend_args), dict( type='LoadPointsFromMultiSweeps', sweeps_num=9, load_dim=5, use_dim=5, pad_empty_sweeps=True, remove_close=True, backend_args=backend_args), dict( type='PointsRangeFilter', point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]), dict( type='Pack3DDetInputs', keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'], meta_keys=[ 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', 'lidar_path', 'img_path', 'num_pts_feats', 'num_views' ]) ]

train_dataloader = dict( batch_size=4, num_workers=4, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=dict( type='CBGSDataset', dataset=dict( type=dataset_type, data_root=data_root, ann_file='nuscenes_infos_train.pkl', pipeline=train_pipeline, metainfo=metainfo, modality=input_modality, test_mode=False, data_prefix=data_prefix, use_valid_flag=True,

we use box_type_3d='LiDAR' in kitti and nuscenes dataset

        # and box_type_3d='Depth' in sunrgbd and scannet dataset.
        box_type_3d='LiDAR')))

val_dataloader = dict( batch_size=1, num_workers=4, persistent_workers=True, drop_last=False, sampler=dict(type='DefaultSampler', shuffle=False), dataset=dict( type=dataset_type, data_root=data_root, ann_file='nuscenes_infos_val.pkl', pipeline=test_pipeline, metainfo=metainfo, modality=input_modality, data_prefix=data_prefix, test_mode=True, box_type_3d='LiDAR', backend_args=backend_args)) test_dataloader = val_dataloader

val_evaluator = dict( type='NuScenesMetric', data_root=data_root, ann_file=data_root + 'nuscenes_infos_val.pkl', metric='bbox', backend_args=backend_args) test_evaluator = val_evaluator

vis_backends = [dict(type='LocalVisBackend')] visualizer = dict( type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')

learning rate

lr = 0.0001 param_scheduler = [

learning rate scheduler

# During the first 8 epochs, learning rate increases from 0 to lr * 10
# during the next 12 epochs, learning rate decreases from lr * 10 to
# lr * 1e-4
dict(
    type='CosineAnnealingLR',
    T_max=8,
    eta_min=lr * 10,
    begin=0,
    end=8,
    by_epoch=True,
    convert_to_iter_based=True),
dict(
    type='CosineAnnealingLR',
    T_max=12,
    eta_min=lr * 1e-4,
    begin=8,
    end=20,
    by_epoch=True,
    convert_to_iter_based=True),
# momentum scheduler
# During the first 8 epochs, momentum increases from 0 to 0.85 / 0.95
# during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
dict(
    type='CosineAnnealingMomentum',
    T_max=8,
    eta_min=0.85 / 0.95,
    begin=0,
    end=8,
    by_epoch=True,
    convert_to_iter_based=True),
dict(
    type='CosineAnnealingMomentum',
    T_max=12,
    eta_min=1,
    begin=8,
    end=20,
    by_epoch=True,
    convert_to_iter_based=True)

]

runtime settings

train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1) val_cfg = dict() test_cfg = dict()

optim_wrapper = dict( type='OptimWrapper', optimizer=dict(type='AdamW', lr=lr, weight_decay=0.01), clip_grad=dict(max_norm=35, norm_type=2))

Default setting for scaling LR automatically

- enable means enable scaling LR automatically

or not by default.

- base_batch_size = (8 GPUs) x (4 samples per GPU).

auto_scale_lr = dict(enable=False, base_batch_size=32) #2258 log_processor = dict(window_size=50)

default_hooks = dict( logger=dict(type='LoggerHook', interval=50), checkpoint=dict(type='CheckpointHook', interval=5)) custom_hooks = [dict(type='DisableObjectSampleHook', disable_after_epoch=15)]

find_unused_parameters=True

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES="2,3,4,5,6,7" bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 6 --cfg-options load_from=work_dirs/lidar/lidar_epoch_20.pth model.img_backbone.init_cfg.checkpoint=pre/swint-nuimages-pretrained.pth --amp

Reproduces the problem - error message

When I use batch_size=4 to train lidar-only module, I can achieve the accuracy of the paper. However, when I trained image+lidar fusion, due to insufficient graphics memory, I set batch_size=2 for training, and the accuracy could only reach around mAP=0.66. Therefore, I also set batch_size=4, but used fp16 mode with an accuracy of mAP=0.67. Is this why? Is training accuracy related to batch_size in bevfusion? Is it related to lidar-only using batch_size=4 and image+lidar using batch_size=2? If so, what should do?

Additional information

No response

Manishnayak234 commented 9 months ago

Hi bro, Can you specify the versions you have installed for all the lib? I am not able to run the bash scripts/convert_data.py

wzqforever commented 9 months ago

I am using the main branch of mmdection3d. In this mmdection3d, you need to use tools/create_data.py to generate nuscenes datasets. You can refer to https://mmdetection3d.readthedocs.io/en/latest/user_guides/dataset_prepare.html.

python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes

@Manishnayak234