[Bug] Poor training results when trying to configure for camera-only BEVFusion

abubake commented 3 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux Python: 3.10.14 (main, Jul 8 2024, 14:50:49) [GCC 12.3.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA GeForce GTX 1080 Ti CUDA_HOME: /usr/local/cuda-12.1 NVCC: Cuda compilation tools, release 12.1, V12.1.66 GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.1.2+cu121 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.9.2
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.16.2+cu121 OpenCV: 4.9.0 MMEngine: 0.10.2 MMDetection: 3.3.0 MMDetection3D: 1.4.0+161d091 spconv2.0: False

Reproduces the problem - code sample

'''
Base is the base configuration file. The config files follow a system of inheritance. For example, just like when you inherit from a class,
this config contains all the configurations of default_runtime.py
The same ideas that apply to inheritance with classes apply here. For example, if you wanted to change something in default_runtime,
you can copy it into this class and make the modifications, just like you would do with a function you would like to change in a class.

Custom_imports imports tje modules within the bevfusion project which are needed to run the code.
'''
_base_ = ['../../../configs/_base_/default_runtime.py']
custom_imports = dict(
    imports=['projects.BEVFusion.bevfusion'], allow_failed_imports=False)

'''
The pointcloud range specifies the geometric space the pointclouds can occupy.
Voxel Size indiciates the distance in meters of each dimension of the squares that make up the BEV grid 
(our map where predictions from BEVFusion are made)
'''
point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] # TODO: step through for more info
# point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
# voxel_size = [0.075, 0.075, 0.2] # this voxel size made it actually have a mAP of 0!
voxel_size = [0.1, 0.1, 0.2]
# image_size = [256, 704]
# post_center_range = [-64.0, -64.0, -10.0, 64.0, 64.0, 10.0]
post_center_range = [-61.2, -61.2, -10.0, 61.2, 61.2, 10.0] # this matches what I see for det in MIT # TODO: step through for more info

'''
Class names used for all object detection tasks. Using nuScenes, we train and evaluate on 6 different object detection tasks, where the combinations of 
object classes for each tasks vary. For example, task 0 may comtain car, truck, and bus, while task 1 may contain car, motorcycle, bicycle, barrier.
'''
class_names = [
    'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier',
    'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
]
'''
metainfo is used to pass the class names from the config in the format the code is looking for.
the dataset type and root specify 1. the class object dataset being used (for other datasets such as KITTI a dataset object is similarly defined)
2. the realtive path to the nuscenes dataset to be used.

data_prefix: data prefix is used for specifying to the nuScenesDataset object what sensors are being used. This can include camera and lidar sensors.
For this case, we inlcude only the 6 cameras available in the nuscenes dataset.
'''
metainfo = dict(classes=class_names)    #, version='v1.0-mini')
dataset_type = 'NuScenesDataset'
data_root = 'data/nuscenes/'

data_prefix = dict(
    CAM_FRONT='samples/CAM_FRONT',
    CAM_FRONT_LEFT='samples/CAM_FRONT_LEFT',
    CAM_FRONT_RIGHT='samples/CAM_FRONT_RIGHT',
    CAM_BACK='samples/CAM_BACK',
    CAM_BACK_RIGHT='samples/CAM_BACK_RIGHT',
    CAM_BACK_LEFT='samples/CAM_BACK_LEFT'
    )

'''
input modality specifies which sensors are being used, which effects...
'''
input_modality = dict(use_lidar=False, use_camera=True) # TODO: determine the effect of lidar=False
backend_args = None # TODO: find out what is

'''
MODEL DEFINITION
- MMLab's way of defining deep learning models.

- type: specifies the project being used
- data_preprocessor: Det3DDataPreprocessor is a general mmdetection3d preprocessing class that works for lidar, vision only, and more.
- img_backbone: this is the model which performs initial transformation from image data into features using a CNN architecture.
    * mmdet.SwinTransformer
- img_neck: this the the model component which takes the first output of the backbone and further refines our features
- 
'''
model = dict(
    type='BEVFusion',
    data_preprocessor=dict(
        type='Det3DDataPreprocessor',
        pad_size_divisor=32,
        # voxelize_cfg=dict(
        #     max_num_points=10,
        #     point_cloud_range=point_cloud_range,
        #     voxel_size=voxel_size,
        #     max_voxels=[120000, 160000],
        #     voxelize_reduce=True),
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        bgr_to_rgb=False),
    img_backbone=dict(
        type='mmdet.SwinTransformer',
        embed_dims=96,
        depths=[2, 2, 6, 2],
        num_heads=[3, 6, 12, 24],
        window_size=7,
        mlp_ratio=4,
        qkv_bias=True,
        qk_scale=None,
        drop_rate=0.0,
        attn_drop_rate=0.0,
        drop_path_rate=0.2,
        patch_norm=True,
        out_indices=[1, 2, 3],
        with_cp=False,
        convert_weights=True,
        init_cfg=dict(
            type='Pretrained',
            checkpoint=  # noqa: E251
            'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth'  # noqa: E501
        )),
    img_neck=dict(
        type='GeneralizedLSSFPN',
        in_channels=[192, 384, 768],
        out_channels=256,
        start_level=0,
        num_outs=3,
        norm_cfg=dict(type='BN2d', requires_grad=True),
        act_cfg=dict(type='ReLU', inplace=True),
        upsample_cfg=dict(mode='bilinear', align_corners=False)),
    view_transform=dict(
        type='LSSTransform',
        in_channels=256,
        out_channels=80,
        image_size=[256, 704],
        feature_size=[32, 88],
        # xbound=[-54.0, 54.0, 0.3],
        xbound=[-51.2, 51.2, 0.4],
        ybound=[-51.2, 51.2, 0.4],
        # ybound=[-54.0, 54.0, 0.3],
        zbound=[-10.0, 10.0, 20.0],
        dbound=[1.0, 60.0, 0.5],
        downsample=2),
    pts_backbone=dict(
        type='GeneralizedResNet',
        in_channels=80,
        blocks=[[2, 128, 2],
                [2, 256, 2],
                [2, 512, 1]]),
    pts_neck=dict(
        type='LSSFPN',
        in_indices=[-1,0],
        in_channels=[512, 128],
        out_channels=256,
        scale_factor=2),
    bbox_head=dict(
        type='CenterHead', # changed from CenterHead to CustomCenterHead
        in_channels=256,
        tasks=[
            dict(num_class=1, class_names=['car']),
            dict(num_class=2, class_names=['truck', 'construction_vehicle']),
            dict(num_class=2, class_names=['bus', 'trailer']),
            dict(num_class=1, class_names=['barrier']),
            dict(num_class=2, class_names=['motorcycle', 'bicycle']),
            dict(num_class=2, class_names=['pedestrian', 'traffic_cone']),
        ],
        common_heads=dict(
            reg=(2, 2), height=(1, 2), dim=(3, 2), rot=(2, 2), vel=(2, 2)),
        share_conv_channel=64,
        bbox_coder=dict(
            type='CenterPointBBoxCoder', # modified from CustomCenterPointBBoxCoder
            post_center_range=post_center_range,
            pc_range=point_cloud_range,
            max_num=500,
            score_threshold=0.1,
            out_size_factor=8,            
            voxel_size=voxel_size[:2],
            code_size=9),
        separate_head=dict(
            type='SeparateHead', init_bias=-2.19, final_kernel=3),
        loss_cls=dict(type='mmdet.GaussianFocalLoss', reduction='mean'),
        loss_bbox=dict(
            type='mmdet.L1Loss', reduction='mean', loss_weight=0.25),
        norm_bbox=True,
        train_cfg=dict(
            dataset='nuScenes',
            point_cloud_range=point_cloud_range,
            grid_size=[1024, 1024, 1],
            # grid_size=[1440, 1440, 41],
            voxel_size=voxel_size,
            out_size_factor=8,
            dense_reg=1,
            gaussian_overlap=0.1,
            max_objs=500,
            min_radius=2,
            code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2]
        ),
        test_cfg=dict(
            dataset='nuScenes',
            post_center_limit_range=post_center_range,
            max_per_img=500,
            max_pool_nms=False,
            min_radius=[4, 12, 10, 1, 0.85, 0.175],
            score_threshold=0.1,
            pc_range=point_cloud_range[:2], # he had 0:2- same thing
            out_size_factor=8,
            voxel_size=voxel_size[:2],
            nms_type= 'circle', #['circle', 'circle', 'circle', 'circle', 'circle', 'circle'], # Changed from just being 'circle'
            pre_max_size=1000,
            post_max_size=83,
            nms_thr=0.2)
    )
)

train_pipeline = [
    dict(
        type='BEVLoadMultiViewImageFromFiles',
        to_float32=False, # was flp32- what if we change?
        color_type='color',
        backend_args=backend_args),
    dict(
        type='LoadAnnotations3D',
        with_bbox_3d=True,
        with_label_3d=True,
        with_attr_label=False),
    # dict(type='ObjectSample', db_sampler=db_sampler),
    dict(
        type='ImageAug3D',
        final_dim=[256, 704],
        resize_lim=[0.38, 0.55],
        bot_pct_lim=[0.0, 0.0],
        rot_lim=[-5.4, 5.4],
        rand_flip=True,
        is_train=True),
    dict(type='BEVFusionRandomFlip3D'), # temporarily commmented out
    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
    dict(
        type='ObjectNameFilter',
        classes=[
            'car', 'truck', 'construction_vehicle', 'bus', 'trailer',
            'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
        ]),
    dict(
        type='GridMask',
        use_h=True,
        use_w=True,
        rotate=1,
        offset=False,
        ratio=0.5,
        mode=1,
        prob=0,
        max_epoch=20,
    ),
    # dict(type='PointShuffle'),
    dict(
        type='Pack3DDetInputs',
        keys=[
            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
            'gt_labels'
        ],
        meta_keys=[
            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
            'lidar_path', 'img_path', 'transformation_3d_flow',
            #'pcd_rotation','pcd_scale_factor', 'pcd_trans', 
            'img_aug_matrix',
            #'lidar_aug_matrix', 'num_pts_feats'
        ])
]

test_pipeline = [
    dict(
        type='BEVLoadMultiViewImageFromFiles', # no BEV prefix in MIT
        to_float32=True,
        color_type='color',
        backend_args=backend_args), # what are the backend args being used??
    dict( # MIT has another type inlcuded, LoadAnnotations3D
        type='ImageAug3D',
        final_dim=[256, 704],
        resize_lim=[0.48, 0.48],
        bot_pct_lim=[0.0, 0.0],
        rot_lim=[0.0, 0.0],
        rand_flip=False,
        is_train=False),
    # dict(
    #     type='PointsRangeFilter',
    #     point_cloud_range=point_cloud_range),
    dict(
        type='Pack3DDetInputs',
        keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'],
        meta_keys=[
            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
            'lidar_path', 'img_path', 'num_pts_feats', 'num_views'
        ])
]

train_dataloader = dict(
    batch_size=1, # changed from 2 to 1
    num_workers=1, # changed from 1 back to 4
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True), #shuffle
    dataset=dict(
       type='CBGSDataset',
    dataset=dict(
            type=dataset_type,
            data_root=data_root,
            ann_file='nuscenes_infos_train.pkl',
            pipeline=train_pipeline,
            metainfo=metainfo,
            modality=input_modality,
            test_mode=False,
            data_prefix=data_prefix,
            use_valid_flag=True,
            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
            box_type_3d='LiDAR'))
            )
val_dataloader = dict(
    batch_size=1,
    num_workers=1,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='nuscenes_infos_val.pkl',
        pipeline=test_pipeline,
        metainfo=metainfo,
        modality=input_modality,
        data_prefix=data_prefix,
        test_mode=True, # test mode was true- does not make sense for val_dataloader perhaps?
        box_type_3d='LiDAR',
        backend_args=backend_args))
test_dataloader = val_dataloader

val_evaluator = dict(
    type='NuScenesMetric',
    data_root=data_root,
    ann_file=data_root + 'nuscenes_infos_val.pkl',
    metric='bbox',
    backend_args=backend_args)
test_evaluator = val_evaluator

vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')

# learning rate
# lr = 0.0001
lr = 2e-5 # changed from 2e-4
param_scheduler = [
    # learning rate scheduler
    # During the first 8 epochs, learning rate increases from 0 to lr * 10
    # during the next 12 epochs, learning rate decreases from lr * 10 to
    # lr * 1e-4
    dict(
        type='CosineAnnealingLR',
        T_max=8,
        eta_min=lr * 6, # changed from 10
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingLR',
        T_max=12,
        eta_min=lr * 1e-2, # changed from -4
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True),
    # momentum scheduler
    # During the first 8 epochs, momentum increases from 0 to 0.85 / 0.95
    # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
    dict(
        type='CosineAnnealingMomentum',
        T_max=8,
        eta_min=0.85 / 0.95,
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingMomentum',
        T_max=12,
        eta_min=1,
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True)
]

# runtime settings
train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1) # do kyoung had change to 10
val_cfg = dict()
test_cfg = dict()

'''
load_from and resume:

load_from: specifies a model path that is either pretrained or partially pretrained that you would like to continue to train from the current state of the weights.
            Specifying "None" type for load_from opts to train from scratch.

            Here is an example of how you might use load_from to train starting with a pretrained model:

            load_from = "/home/a0271391/code/edgeai-mmdetection3d/projects/BEVFusion/models/camera-only-det_converted_copy.pth"

resume: be aware: resume=True means that you want to resume training from a specific training epoch and step for a particular model. If you don't care about actually resuming
        training from where training was stopped previously, then you don't need to set resume True. Only set it True if the model you are loading with load_from was trained
        to a specific point (ex: on epoch 7, step 19200/30000) and you want to continue from there.
'''
load_from = None
resume = False # resume from the checkpoint defined in load_from

optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=lr, weight_decay=0.01),
    clip_grad=dict(max_norm=35, norm_type=2))

# Default setting for scaling LR automatically
#   - `enable` means enable scaling LR automatically
#       or not by default.
#   - `base_batch_size` = (8 GPUs) x (4 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=1)
log_processor = dict(window_size=50)

'''
HOOKS - 

Objects which operate on actively running code, such as logging information at the end of an epoch.
Hooks are defined in mmdet3d/engine. The purpose of hooks is often to add new features to a predefined python module.

EX: You want to be able to add additional data to your dataloader when training a model every 3 epochs. You could modify the source code for training, or you could make a
hook which adds that functionality on top of your base code. Then all you have to do is init that hook when defining the parameters of your code, or not init it if you want the
base funtionality.

Here, they are used for logging information such as speed to train an epoch.
The DisableObjectSampleHook simply stops augmenting the training data after a specified epoch (epoch 15)
'''
default_hooks = dict(
    logger=dict(type='LoggerHook', interval=50),
    checkpoint=dict(type='CheckpointHook', interval=1))
custom_hooks = [dict(type='DisableObjectSampleHook', disable_after_epoch=15)]

Reproduces the problem - command or script

bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_cam_swint_centerpoint_nus-3d.py 4

Reproduces the problem - error message

No error message; issue is even after 20 epochs, the result is extremely poor mAP and NMS. Loss gets down to about 6.x.

Additional information

I expected training results to be similar to MIT's camera-only results.
I used the nuScenes dataset
I suspect there is an issue in my setup in the configuration file. I have included the configuration I have been using for image-only BEVFusion.

ymlab commented 2 months ago

Same problem.

gorkemguzeler commented 1 month ago

hi @ymlab @abubake,

I have a question regarding the training:

I am curious how much time the training takes per epoch and how many gpus do you use? I am particularly interested in the lidar only training if you have any experience with that.

abubake commented 1 month ago

Hi, training with 4 gpu’s took several hours per epoch, both for camera and when I tried with lidar only. I don’t remember the exact time per epoch, but it was about 4-5 days for 20 epochs. Which is roughly 4.5 to 6 hours per epoch.

On Wed, Oct 9, 2024 at 5:34 PM Görkem Güzeler @.***> wrote:

hi @ymlab https://github.com/ymlab @abubake https://github.com/abubake ,

I have a question regarding the training:

I am curious how much time the training takes per epoch and how many gpus do you use? I am particularly interested in the lidar only training if you have any experience with that.

— Reply to this email directly, view it on GitHub https://github.com/open-mmlab/mmdetection3d/issues/3024#issuecomment-2403469447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHWNVWHXYCSWYDM4OI5DWZLZ2WOQDAVCNFSM6AAAAABM2JY5YWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBTGQ3DSNBUG4 . You are receiving this because you were mentioned.Message ID: @.***>

gorkemguzeler commented 1 month ago

Thanks a lot for sharing your experience @abubake, it helps! were you able to reproduce good results (comparable to the paper) with lidar only training?

I plan to work with this repository for my thesis, and don't want to waste time if the code is not working as expected. therefore any feedback is valuable for me :)

mdessl commented 1 month ago

@gorkemguzeler the repo is working as expected for me. Haven't trained lidar-only but I got 65 mAP after 3 epochs of training the bevfusion model with the lidar-only base. Oh and it took 2h per epoch on 8x 3090 with bs 2 and lr scaling enabled.

Btw we are in the same boat. I am also doing my thesis on multimodal learning :)

curiosity654 commented 1 month ago

@mdessl Hi I'm also working on multimodal 3d det. I'm curious by bs 2 you mean 2 batch per GPU or 2 batch for the whole 8 GPUs? As 3080 seems only have 12G of mem. I have trained the BEVFusion of this repo on 2xA5000 with bs of 4 (with lr scale) and cannot match the result of 71.4 NDS. After using Gradient Accumulation to simulate bs 32, the performance is much better to approximately 70.9 NDS.

For the multimodal, my concern is that the camera branch of this repo is too dependent on LiDAR, as they use DepthLSS instead of original LSS transform.

mdessl commented 1 month ago

@curiosity654 ohh sry it was a typo. I meant 3090 (24G RAM), so bs 2 per GPU.

Do you think the issue could have to do with the batchnorm layers? I think BN is not so compatible with gradient accumulation and I am not sure what you could do about it.

gorkemguzeler commented 1 month ago

@mdessl , thanks a lot for the feedback 👍

oh, good luck on your thesis :)

open-mmlab / mmdetection3d