megvii-research / PETR

[ECCV2022] PETR: Position Embedding Transformation for Multi-View 3D Object Detection & [ICCV2023] PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images
Other
879 stars 132 forks source link

Can not reproduce petr #86

Open maggiesong7 opened 1 year ago

maggiesong7 commented 1 year ago

When running [petr_r50dcn_gridmask_p4.py](https://github.com/megvii-research/PETR/blob/main/projects/configs/petr/petr_r50dcn_gridmask_p4.py), the accuracy I got was: mAP: 0.3022 mATE: 0.8507 mASE: 0.2785 mAOE: 0.6519 mAVE: 1.0027 mAAE: 0.2668 NDS: 0.3463 Eval time: 302.2s

This is much lower than the reported one. Also, we I set with_position=False, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS.

yingfei1016 commented 1 year ago

Hi, Do you modified the batchsize or other parameters? Can you share your config.

maggiesong7 commented 1 year ago

Hi, I only change the dataset path. here is my config:

point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] class_names = [ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ] dataset_type = 'CustomNuScenesDataset' data_root = './data/nuscenes/' input_modality = dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False) file_client_args = dict(backend='disk') train_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_attr_label=False), dict( type='ObjectRangeFilter', point_cloud_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0]), dict( type='ObjectNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=True), dict( type='GlobalRotScaleTransImage', rot_range=[-0.3925, 0.3925], translation_std=[0, 0, 0], scale_ratio_range=[0.95, 1.05], reverse_angle=True, training=True), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict(type='Collect3D', keys=['gt_bboxes_3d', 'gt_labels_3d', 'img']) ] test_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ] eval_pipeline = [ dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, file_client_args=dict(backend='disk')), dict( type='LoadPointsFromMultiSweeps', sweeps_num=10, file_client_args=dict(backend='disk')), dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier' ], with_label=False), dict(type='Collect3D', keys=['points']) ] data = dict( samples_per_gpu=1, workers_per_gpu=4, train=dict( type='CustomNuScenesDataset', data_root='./data/nuscenes/', ann_file='./data/nuscenes/nuscenes_infos_train.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_attr_label=False), dict( type='ObjectRangeFilter', point_cloud_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0]), dict( type='ObjectNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=True), dict( type='GlobalRotScaleTransImage', rot_range=[-0.3925, 0.3925], translation_std=[0, 0, 0], scale_ratio_range=[0.95, 1.05], reverse_angle=True, training=True), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict( type='Collect3D', keys=['gt_bboxes_3d', 'gt_labels_3d', 'img']) ], classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), test_mode=False, box_type_3d='LiDAR', use_valid_flag=True), val=dict( type='CustomNuScenesDataset', data_root='data/nuscenes/', ann_file='data/nuscenes/nuscenes_infos_val.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ], classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), test_mode=True, box_type_3d='LiDAR'), test=dict( type='CustomNuScenesDataset', data_root='data/nuscenes/', ann_file='data/nuscenes/nuscenes_infos_val.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ], classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), test_mode=True, box_type_3d='LiDAR')) evaluation = dict( interval=1, pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ]) checkpoint_config = dict(interval=1) log_config = dict( interval=50, hooks=[dict(type='TextLoggerHook'), dict(type='TensorboardLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' work_dir = 'work_dirs/petr_r50dcn_gridmask_p4/' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' backbone_norm_cfg = dict(type='LN', requires_grad=True) plugin = True plugin_dir = 'projects/mmdet3d_plugin/' voxel_size = [0.2, 0.2, 8] img_norm_cfg = dict( mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False) model = dict( type='Petr3D', use_grid_mask=True, img_backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(2, 3), frozen_stages=-1, norm_cfg=dict(type='BN2d', requires_grad=False), norm_eval=True, style='caffe', with_cp=True, dcn=dict(type='DCNv2', deform_groups=1, fallback_on_stride=False), stage_with_dcn=(False, False, True, True), pretrained='ckpts/resnet50_msra-5891d200.pth'), img_neck=dict( type='CPFPN', in_channels=[1024, 2048], out_channels=256, num_outs=2), pts_bbox_head=dict( type='PETRHead', num_classes=10, in_channels=256, num_query=900, LID=True, with_position=True, with_multiview=True, position_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], normedlinear=False, transformer=dict( type='PETRTransformer', decoder=dict( type='PETRTransformerDecoder', return_intermediate=True, num_layers=6, transformerlayers=dict( type='PETRTransformerDecoderLayer', attn_cfgs=[ dict( type='MultiheadAttention', embed_dims=256, num_heads=8, dropout=0.1), dict( type='PETRMultiheadAttention', embed_dims=256, num_heads=8, dropout=0.1) ], feedforward_channels=2048, ffn_dropout=0.1, with_cp=True, operation_order=('self_attn', 'norm', 'cross_attn', 'norm', 'ffn', 'norm')))), bbox_coder=dict( type='NMSFreeCoder', post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], pc_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0], max_num=300, voxel_size=[0.2, 0.2, 8], num_classes=10), positional_encoding=dict( type='SinePositionalEncoding3D', num_feats=128, normalize=True), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=2.0), loss_bbox=dict(type='L1Loss', loss_weight=0.25), loss_iou=dict(type='GIoULoss', loss_weight=0.0)), train_cfg=dict( pts=dict( grid_size=[512, 512, 1], voxel_size=[0.2, 0.2, 8], point_cloud_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0], out_size_factor=4, assigner=dict( type='HungarianAssigner3D', cls_cost=dict(type='FocalLossCost', weight=2.0), reg_cost=dict(type='BBox3DL1Cost', weight=0.25), iou_cost=dict(type='IoUCost', weight=0.0), pc_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0])))) db_sampler = dict() ida_aug_conf = dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True) optimizer = dict( type='AdamW', lr=0.0002, paramwise_cfg=dict(custom_keys=dict(img_backbone=dict(lr_mult=0.1))), weight_decay=0.01) optimizer_config = dict( type='Fp16OptimizerHook', loss_scale=512.0, grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='CosineAnnealing', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, min_lr_ratio=0.001) total_epochs = 24 find_unused_parameters = False runner = dict(type='EpochBasedRunner', max_epochs=24) gpu_ids = range(0, 8)

yingfei1016 commented 1 year ago

Hi,

The config has no problem. Can you tell me the gpu number and the version of python and mmdet3d? Python3.8 may drops some performance.

maggiesong7 commented 1 year ago

I use 8 2080ti to train. And I have trained the model using two different python versions, that is, python 3.7.6 and python 3.6.5, both of them are with mmdet3d 1.0.0. Also, when I set with_position=False, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS. In my opinion, setting with_position=False is just a kind of ablation study about the 3D PE module. Can you explain that?

yingfei1016 commented 1 year ago

Hi,

(1) When use mmdet1.0, have you notice here https://github.com/megvii-research/PETR/issues/71#issuecomment-1318191277 . The reverse_angle must be False in GlobalRotScaleTransImage. (2) Yes, when set with_position=False, it's a result in ablation study. image

When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.

xiaosu-zhu commented 1 year ago

Hi,

(1) When use mmdet1.0, have you notice here #71 (comment) . The reverse_angle must be False in GlobalRotScaleTransImage. (2) Yes, when set with_position=False, it's a result in ablation study. image

When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.

I have noticed StreamPETR still set reverse_angle=True but they use mmdet3d=1.0.0rc6, have I missed something?

yingfei1016 commented 1 year ago

Hi, (1) When use mmdet1.0, have you notice here #71 (comment) . The reverse_angle must be False in GlobalRotScaleTransImage. (2) Yes, when set with_position=False, it's a result in ablation study. image When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.

I have noticed StreamPETR still set reverse_angle=True but they use mmdet3d=1.0.0rc6, have I missed something?

The rotate matrix is different.

xiaosu-zhu commented 1 year ago

Hi, (1) When use mmdet1.0, have you notice here #71 (comment) . The reverse_angle must be False in GlobalRotScaleTransImage. (2) Yes, when set with_position=False, it's a result in ablation study. image When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.

I have noticed StreamPETR still set reverse_angle=True but they use mmdet3d=1.0.0rc6, have I missed something?

The rotate matrix is different.

Thanks, got it. 👍

Vendulamrdka95 commented 1 year ago

https://github.com/megvii-research/PETR/issues/86#issue-1499621424

Vendulamrdka95 commented 1 year ago

When running [petr_r50dcn_gridmask_p4.py](https://github.com/megvii-research/PETR/blob/main/projects/configs/petr/petr_r50dcn_gridmask_p4.py), the accuracy I got was: mAP: 0.3022 mATE: 0.8507 mASE: 0.2785 mAOE: 0.6519 mAVE: 1.0027 mAAE: 0.2668 NDS: 0.3463 Eval time: 302.2s

This is much lower than the reported one. Also, we I set with_position=False, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS. / /

When running [petr_r50dcn_gridmask_p4.py](https://github.com/megvii-research/PETR/blob/main/projects/configs/petr/petr_r50dcn_gridmask_p4.py), the accuracy I got was: mAP: 0.3022 mATE: 0.8507 mASE: 0.2785 mAOE: 0.6519 mAVE: 1.0027 mAAE: 0.2668 NDS: 0.3463 Eval time: 302.2s

This is much lower than the reported one. Also, we I set with_position=False, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS.