tusen-ai / SST

Code for a series of work in LiDAR perception, including SST (CVPR 22), FSD (NeurIPS 22), FSD++ (TPAMI 23), FSDv2, and CTRL (ICCV 23, oral).
Apache License 2.0
779 stars 100 forks source link

About other datasets #18

Closed Zoeeeing closed 2 years ago

Zoeeeing commented 2 years ago

Hi, have you experimented on some other outdoor datasets such as nuscenes? As i used SST to train on nuScenes dataset, the results i got were not ideal. I just modified the hyperparameters about the voxel size and replaced the head .I would like to ask whether there is a problem. Thanks!

Abyssaledge commented 2 years ago

Thanks for using SST. No, we have not tried SST on nuScenes. But If you share your config and detailed results, maybe we can help you.

Zoeeeing commented 2 years ago

Thanks! The modified model is as follows:


voxel_size=(0.25, 0.25, 8),
window_shape = (16, 16, 1),
point_cloud_range=[-50, -50, -5, 50, 50, 3],
model = dict(
    type='DynamicVoxelNet',
    voxel_layer=dict(
        voxel_size=(0.25, 0.25, 8),
        max_num_points=-1,
        point_cloud_range=[-50, -50, -5, 50, 50, 3],
        max_voxels=(-1, -1)),
    voxel_encoder=dict(
        type='DynamicVFE',
        in_channels=4,
        feat_channels=[64, 128],
        with_distance=False,
        voxel_size=(0.25, 0.25, 8),
        with_cluster_center=True,
        with_voxel_center=True,
        point_cloud_range=[-50, -50, -5, 50, 50, 3],
        norm_cfg=dict(type='naiveSyncBN1d', eps=0.001, momentum=0.01)),
    middle_encoder=dict(
        type='SSTInputLayerV2',
        window_shape=(16, 16, 1),
        sparse_shape=(400, 400, 1),
        shuffle_voxels=True,
        debug=True,
        drop_info=({
            0: {
                'max_tokens': 100,
                'drop_range': (0, 100)
            },
            1: {
                'max_tokens': 200,
                'drop_range': (100, 200)
            },
            2: {
                'max_tokens': 250,
                'drop_range': (200, 10000)
            }
        }, {
            0: {
                'max_tokens': 100,
                'drop_range': (0, 100)
            },
            1: {
                'max_tokens': 200,
                'drop_range': (100, 200)
            },
            2: {
                'max_tokens': 256,
                'drop_range': (200, 10000)
            }
        }),
        pos_temperature=10000,
        normalize_pos=False),
    backbone=dict(
        type='SSTv2',
        d_model=[128, 128, 128, 128, 128, 128],
        nhead=[8, 8, 8, 8, 8, 8],
        num_blocks=6,
        dim_feedforward=[256, 256, 256, 256, 256, 256],
        output_shape=[400, 400],
        num_attached_conv=3,
        conv_kwargs=[
            dict(kernel_size=3, dilation=1, padding=1, stride=1),
            dict(kernel_size=3, dilation=1, padding=1, stride=1),
            dict(kernel_size=3, dilation=2, padding=2, stride=1)
        ],
        conv_in_channel=128,
        conv_out_channel=128,
        debug=True),
    neck=dict(
        type='SECONDFPN',
        norm_cfg=dict(type='naiveSyncBN2d', eps=0.001, momentum=0.01),
        in_channels=[128],
        upsample_strides=[1],
        out_channels=[384]),
    bbox_head=dict(
        type='Anchor3DHead',
        num_classes=10,
        in_channels=384,
        feat_channels=384,
        use_direction_classifier=True,
        anchor_generator=dict(
            type='AlignedAnchor3DRangeGenerator',
            ranges=[[-49.6, -49.6, -1.80032795, 49.6, 49.6, -1.80032795],
                    [-49.6, -49.6, -1.74440365, 49.6, 49.6, -1.74440365],
                    [-49.6, -49.6, -1.68526504, 49.6, 49.6, -1.68526504],
                    [-49.6, -49.6, -1.67339111, 49.6, 49.6, -1.67339111],
                    [-49.6, -49.6, -1.61785072, 49.6, 49.6, -1.61785072],
                    [-49.6, -49.6, -1.80984986, 49.6, 49.6, -1.80984986],
                    [-49.6, -49.6, -1.763965, 49.6, 49.6, -1.763965]],
            sizes=[[1.95017717, 4.60718145, 1.72270761],
                   [2.4560939, 6.73778078, 2.73004906],
                   [2.87427237, 12.01320693, 3.81509561],
                   [0.60058911, 1.68452161, 1.27192197],
                   [0.66344886, 0.7256437, 1.75748069],
                   [0.39694519, 0.40359262, 1.06232151],
                   [2.49008838, 0.48578221, 0.98297065]],
            custom_values=[0, 0],
            rotations=[0, 1.57],
            reshape_out=True),
        assigner_per_size=False,
        diff_rad_by_sin=True,
        dir_offset=0.7854,
        dir_limit_offset=0,
        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder', code_size=9),
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_bbox=dict(
            type='SmoothL1Loss', beta=0.1111111111111111, loss_weight=1.0),
        loss_dir=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.2)),
    train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            iou_calculator=dict(type='BboxOverlapsNearest3D'),
            pos_iou_thr=0.6,
            neg_iou_thr=0.3,
            min_pos_iou=0.3,
            ignore_iof_thr=-1),
        allowed_border=0,
        code_weight=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2],
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        use_rotate_nms=True,
        nms_across_levels=False,
        nms_pre=1000,
        nms_thr=0.2,
        score_thr=0.05,
        min_bbox_size=0,
        max_num=500))

After training for 24 epochs, i got the detailed results as follows.

pts_bbox_NuScenes/car_AP_dist_0.5: 0.4701, pts_bbox_NuScenes/car_AP_dist_1.0: 0.6067, pts_bbox_NuScenes/car_AP_dist_2.0: 0.6618, pts_bbox_NuScenes/car_AP_dist_4.0: 0.6832, pts_bbox_NuScenes/car_trans_err: 0.2372, pts_bbox_NuScenes/car_scale_err: 0.1477, pts_bbox_NuScenes/car_orient_err: 0.1317, pts_bbox_NuScenes/car_vel_err: 0.2814, pts_bbox_NuScenes/car_attr_err: 0.2252, pts_bbox_NuScenes/mATE: 0.4841, pts_bbox_NuScenes/mASE: 0.2709, pts_bbox_NuScenes/mAOE: 0.5280, pts_bbox_NuScenes/mAVE: 0.3700, pts_bbox_NuScenes/mAAE: 0.1962, pts_bbox_NuScenes/truck_AP_dist_0.5: 0.0624, pts_bbox_NuScenes/truck_AP_dist_1.0: 0.2224, pts_bbox_NuScenes/truck_AP_dist_2.0: 0.3657, pts_bbox_NuScenes/truck_AP_dist_4.0: 0.3988, pts_bbox_NuScenes/truck_trans_err: 0.5955, pts_bbox_NuScenes/truck_scale_err: 0.2285, pts_bbox_NuScenes/truck_orient_err: 0.2259, pts_bbox_NuScenes/truck_vel_err: 0.2660, pts_bbox_NuScenes/truck_attr_err: 0.2360, pts_bbox_NuScenes/trailer_AP_dist_0.5: 0.0000, pts_bbox_NuScenes/trailer_AP_dist_1.0: 0.0000, pts_bbox_NuScenes/trailer_AP_dist_2.0: 0.0073, pts_bbox_NuScenes/trailer_AP_dist_4.0: 0.0857, pts_bbox_NuScenes/trailer_trans_err: 0.9790, pts_bbox_NuScenes/trailer_scale_err: 0.2405, pts_bbox_NuScenes/trailer_orient_err: 0.9358, pts_bbox_NuScenes/trailer_vel_err: 0.3954, pts_bbox_NuScenes/trailer_attr_err: 0.1308, pts_bbox_NuScenes/bus_AP_dist_0.5: 0.0105, pts_bbox_NuScenes/bus_AP_dist_1.0: 0.1396, pts_bbox_NuScenes/bus_AP_dist_2.0: 0.3895, pts_bbox_NuScenes/bus_AP_dist_4.0: 0.4736, pts_bbox_NuScenes/bus_trans_err: 0.7881, pts_bbox_NuScenes/bus_scale_err: 0.1895, pts_bbox_NuScenes/bus_orient_err: 0.1455, pts_bbox_NuScenes/bus_vel_err: 0.6699, pts_bbox_NuScenes/bus_attr_err: 0.1602, pts_bbox_NuScenes/construction_vehicle_AP_dist_0.5: 0.0000, pts_bbox_NuScenes/construction_vehicle_AP_dist_1.0: 0.0036, pts_bbox_NuScenes/construction_vehicle_AP_dist_2.0: 0.0457, pts_bbox_NuScenes/construction_vehicle_AP_dist_4.0: 0.0629, pts_bbox_NuScenes/construction_vehicle_trans_err: 0.9470, pts_bbox_NuScenes/construction_vehicle_scale_err: 0.5084, pts_bbox_NuScenes/construction_vehicle_orient_err: 1.3642, pts_bbox_NuScenes/construction_vehicle_vel_err: 0.1244, pts_bbox_NuScenes/construction_vehicle_attr_err: 0.4645, pts_bbox_NuScenes/bicycle_AP_dist_0.5: 0.0264, pts_bbox_NuScenes/bicycle_AP_dist_1.0: 0.0287, pts_bbox_NuScenes/bicycle_AP_dist_2.0: 0.0290, pts_bbox_NuScenes/bicycle_AP_dist_4.0: 0.0298, pts_bbox_NuScenes/bicycle_trans_err: 0.1875, pts_bbox_NuScenes/bicycle_scale_err: 0.2586, pts_bbox_NuScenes/bicycle_orient_err: 0.8511, pts_bbox_NuScenes/bicycle_vel_err: 0.3377, pts_bbox_NuScenes/bicycle_attr_err: 0.0047, pts_bbox_NuScenes/motorcycle_AP_dist_0.5: 0.1205, pts_bbox_NuScenes/motorcycle_AP_dist_1.0: 0.1384, pts_bbox_NuScenes/motorcycle_AP_dist_2.0: 0.1415, pts_bbox_NuScenes/motorcycle_AP_dist_4.0: 0.1458, pts_bbox_NuScenes/motorcycle_trans_err: 0.2381, pts_bbox_NuScenes/motorcycle_scale_err: 0.2787, pts_bbox_NuScenes/motorcycle_orient_err: 0.7527, pts_bbox_NuScenes/motorcycle_vel_err: 0.6352, pts_bbox_NuScenes/motorcycle_attr_err: 0.3060, pts_bbox_NuScenes/pedestrian_AP_dist_0.5: 0.5656, pts_bbox_NuScenes/pedestrian_AP_dist_1.0: 0.5758, pts_bbox_NuScenes/pedestrian_AP_dist_2.0: 0.5854, pts_bbox_NuScenes/pedestrian_AP_dist_4.0: 0.5960, pts_bbox_NuScenes/pedestrian_trans_err: 0.1403, pts_bbox_NuScenes/pedestrian_scale_err: 0.2611, pts_bbox_NuScenes/pedestrian_orient_err: 0.3074, pts_bbox_NuScenes/pedestrian_vel_err: 0.2499, pts_bbox_NuScenes/pedestrian_attr_err: 0.0425, pts_bbox_NuScenes/traffic_cone_AP_dist_0.5: 0.0727, pts_bbox_NuScenes/traffic_cone_AP_dist_1.0: 0.0775, pts_bbox_NuScenes/traffic_cone_AP_dist_2.0: 0.0849, pts_bbox_NuScenes/traffic_cone_AP_dist_4.0: 0.1073, pts_bbox_NuScenes/traffic_cone_trans_err: 0.1638, pts_bbox_NuScenes/traffic_cone_scale_err: 0.3195, pts_bbox_NuScenes/traffic_cone_orient_err: nan, pts_bbox_NuScenes/traffic_cone_vel_err: nan, pts_bbox_NuScenes/traffic_cone_attr_err: nan, pts_bbox_NuScenes/barrier_AP_dist_0.5: 0.0680, pts_bbox_NuScenes/barrier_AP_dist_1.0: 0.2386, pts_bbox_NuScenes/barrier_AP_dist_2.0: 0.3307, pts_bbox_NuScenes/barrier_AP_dist_4.0: 0.3615, pts_bbox_NuScenes/barrier_trans_err: 0.5643, pts_bbox_NuScenes/barrier_scale_err: 0.2763, pts_bbox_NuScenes/barrier_orient_err: 0.0374, pts_bbox_NuScenes/barrier_vel_err: nan, pts_bbox_NuScenes/barrier_attr_err: nan, pts_bbox_NuScenes/NDS: 0.4278, pts_bbox_NuScenes/mAP: 0.2253

Abyssaledge commented 2 years ago

Your config looks fine to me. I am sorry that I do not have enough information to explain the poor results. We will try to run SST on nuScenes, but I can not provide the precise schedule for now. My suggestion is to debug each component (backbone/head/) using a small datasize. For example, changing the anchor head to the center head to check if the head module is correct.

Zoeeeing commented 2 years ago

OK. I will debug the component and check the result when you run on nuScenes. Thanks for your work.

Devoe-97 commented 2 years ago

Hi, do you have more recent results on nuscenes? @Zoeeeing

Zoeeeing commented 2 years ago

@Devoe-97 Sorry I can not get some better results.

gopi-erabati commented 2 years ago

@Abyssaledge did you try to run experiments on nuScenes dataset ? As nuScenes has less (5 times) samples than Waymo, does that have any effect on training from scratch to get such poor results on nuScenes ? (Because transformers are data hungry!!!) What do you think about it?

Abyssaledge commented 2 years ago

@gopi231091 I have not run the experiments on nuScenes yet. To my knowledge, SST is not that data-hungry. It has a better performance than PointPillars baseline with 20% training data on Waymo. However, its performance in nuScenes might a little worse than the SOTAs because the Pillar-based models show inferior performance in nuScenes, which is observed by many researchers.