mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
https://bevfusion.mit.edu
Apache License 2.0
2.26k stars 409 forks source link

Use fusion model to infer on custom application #459

Closed GDMG99 closed 4 months ago

GDMG99 commented 1 year ago

Hi! Thank you authors for your amazing work! I have been working with BEVFusion for some time and I would like to implement it on a detection pipeline for a custom application. I have already trained and evaluated the model on different nuScenes-like datasets. This time I have a LiDAR and two cameras and I am already working on changing the data format to match the nuScenes one. I have been looking at the class BEVFusion(Base3DFusionModel): (here), but it needs encoder, fuser, decoder, and heads dictionary inputs. Are these dictionaries the same as in the config yaml files? From the BEVFusion class I believe I should use the forward function. Nevertheless, I cannot see how to input the weights of the model or the config file anywhere. Could someone help me out here? Thanks a lot in advance!

GDMG99 commented 1 year ago

Also, I have been trying to initialize the class BevFusion(Base3DFusionModel). The dictionaries that I am using to initialize the class are from the config file (.yaml) of the pretrained weights of the fusion object detection model.

decoder:
    backbone:
      conv_cfg:
        bias: false
        type: Conv2d
      in_channels: 256
      layer_nums:
      - 5
      - 5
      layer_strides:
      - 1
      - 2
      norm_cfg:
        eps: 0.001
        momentum: 0.01
        type: BN
      out_channels:
      - 128
      - 256
      type: SECOND
    neck:
      in_channels:
      - 128
      - 256
      norm_cfg:
        eps: 0.001
        momentum: 0.01
        type: BN
      out_channels:
      - 256
      - 256
      type: SECONDFPN
      upsample_cfg:
        bias: false
        type: deconv
      upsample_strides:
      - 1
      - 2
      use_conv_for_no_stride: true
  encoders:
    camera:
      backbone:
        attn_drop_rate: 0.0
        convert_weights: true
        depths:
        - 2
        - 2
        - 6
        - 2
        drop_path_rate: 0.2
        drop_rate: 0.0
        embed_dims: 96
        init_cfg:
          checkpoint: pretrained/swint-nuimages-pretrained.pth
          type: Pretrained
        mlp_ratio: 4
        num_heads:
        - 3
        - 6
        - 12
        - 24
        out_indices:
        - 1
        - 2
        - 3
        patch_norm: true
        qk_scale: null
        qkv_bias: true
        type: SwinTransformer
        window_size: 7
        with_cp: false
      neck:
        act_cfg:
          inplace: true
          type: ReLU
        in_channels:
        - 192
        - 384
        - 768
        norm_cfg:
          requires_grad: true
          type: BN2d
        num_outs: 3
        out_channels: 256
        start_level: 0
        type: GeneralizedLSSFPN
        upsample_cfg:
          align_corners: false
          mode: bilinear
      vtransform:
        dbound:
        - 1.0
        - 60.0
        - 0.5
        downsample: 2
        feature_size:
        - 32
        - 88
        image_size:
        - 256
        - 704
        in_channels: 256
        out_channels: 80
        type: DepthLSSTransform
        xbound:
        - -54.0
        - 54.0
        - 0.3
        ybound:
        - -54.0
        - 54.0
        - 0.3
        zbound:
        - -10.0
        - 10.0
        - 20.0
    lidar:
      backbone:
        block_type: basicblock
        encoder_channels:
        - - 16
          - 16
          - 32
        - - 32
          - 32
          - 64
        - - 64
          - 64
          - 128
        - - 128
          - 128
        encoder_paddings:
        - - 0
          - 0
          - 1
        - - 0
          - 0
          - 1
        - - 0
          - 0
          - - 1
            - 1
            - 0
        - - 0
          - 0
        in_channels: 5
        order:
        - conv
        - norm
        - act
        output_channels: 128
        sparse_shape:
        - 1440
        - 1440
        - 41
        type: SparseEncoder
      voxelize:
        max_num_points: 10
        max_voxels:
        - 120000
        - 160000
        point_cloud_range:
        - -54.0
        - -54.0
        - -5.0
        - 54.0
        - 54.0
        - 3.0
        voxel_size:
        - 0.075
        - 0.075
        - 0.2
  fuser:
    in_channels:
    - 80
    - 256
    out_channels: 256
    type: ConvFuser
  heads:
    map: null
    object:
      activation: relu
      auxiliary: true
      bbox_coder:
        code_size: 10
        out_size_factor: 8
        pc_range:
        - -54.0
        - -54.0
        post_center_range:
        - -61.2
        - -61.2
        - -10.0
        - 61.2
        - 61.2
        - 10.0
        score_threshold: 0.0
        type: TransFusionBBoxCoder
        voxel_size:
        - 0.075
        - 0.075
      bn_momentum: 0.1
      common_heads:
        center:
        - 2
        - 2
        dim:
        - 3
        - 2
        height:
        - 1
        - 2
        rot:
        - 2
        - 2
        vel:
        - 2
        - 2
      dropout: 0.1
      ffn_channel: 256
      hidden_channel: 128
      in_channels: 512
      loss_bbox:
        loss_weight: 0.25
        reduction: mean
        type: L1Loss
      loss_cls:
        alpha: 0.25
        gamma: 2.0
        loss_weight: 1.0
        reduction: mean
        type: FocalLoss
        use_sigmoid: true
      loss_heatmap:
        loss_weight: 1.0
        reduction: mean
        type: GaussianFocalLoss
      nms_kernel_size: 3
      num_classes: 10
      num_decoder_layers: 1
      num_heads: 8
      num_proposals: 200
      test_cfg:
        dataset: nuScenes
        grid_size:
        - 1440
        - 1440
        - 41
        nms_type: null
        out_size_factor: 8
        pc_range:
        - -54.0
        - -54.0
        voxel_size:
        - 0.075
        - 0.075
      train_cfg:
        assigner:
          cls_cost:
            alpha: 0.25
            gamma: 2.0
            type: FocalLossCost
            weight: 0.15
          iou_calculator:
            coordinate: lidar
            type: BboxOverlaps3D
          iou_cost:
            type: IoU3DCost
            weight: 0.25
          reg_cost:
            type: BBoxBEVL1Cost
            weight: 0.25
          type: HungarianAssigner3D
        code_weights:
        - 1.0
        - 1.0
        - 1.0
        - 1.0
        - 1.0
        - 1.0
        - 1.0
        - 1.0
        - 0.2
        - 0.2
        dataset: nuScenes
        gaussian_overlap: 0.1
        grid_size:
        - 1440
        - 1440
        - 41
        min_radius: 2
        out_size_factor: 8
        point_cloud_range:
        - -54.0
        - -54.0
        - -5.0
        - 54.0
        - 54.0
        - 3.0
        pos_weight: -1
        voxel_size:
        - 0.075
        - 0.075
        - 0.2
      type: TransFusionHead

In principle, the class properly loads the encoders, decoder, and fuser. However, with the heads, I get the following error: AttributeError: TransFusionHead: 'dict' object has no attribute 'assigner', yet assigner is in heads['object']['train_cfg']. Does anyone know how to properly initialize the class? Also I still do not see how to input the desired weights during the initialization. Thank you in advance

zhijian-liu commented 4 months ago

Adding support for a custom dataset is beyond the scope of this codebase, and unfortunately, we don't have the capacity to accommodate such customized requests.