mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
https://bevfusion.mit.edu
Apache License 2.0
2.22k stars 405 forks source link

custom dataset implementation #163

Closed RG2806 closed 1 year ago

RG2806 commented 1 year ago

Hey, Thanks for this open source work. I am training a fusion model on my custom dataset I wanted to understand the reasoning behind these extra max pooling layers and how do I know, if I need to do the same for my dataset classes. https://github.com/mit-han-lab/bevfusion/blob/main/mmdet3d/models/heads/bbox/transfusion.py#L248 image

same for this one https://github.com/mit-han-lab/bevfusion/blob/main/mmdet3d/models/heads/bbox/transfusion.py#L751 image

kentang-mit commented 1 year ago

I think the code snippet you mentioned is specialized for nuScenes and is designed by the original authors of TransFusion. For a custom dataset, I would suggest you to get started with the CenterHead, which has fewer parameters to be tuned.

RG2806 commented 1 year ago

Hey, I want to train a Fusion model. Would that be possible with centerhead. In your repo, I only see camera-only support. If possible, how should i proceed?

kentang-mit commented 1 year ago

Yes, that is possible. Actually you only need to change the configurations of model.heads.object to CenterHead configs. It will still work.

RG2806 commented 1 year ago

hey, thanks, for that. I'll experiment with it. Mean while can you tell about this 'tasks' in the centerhead config:

https://github.com/mit-han-lab/bevfusion/blob/main/configs/nuscenes/det/centerhead/default.yaml#L28
image

kentang-mit commented 1 year ago

Sure. This is related to the CBGS paper. All classes are divided into different groups (and processed with different heads). The intuition is that the classes within each group usually have similar sizes.

For custom datasets, I think usually it will work if you just have one task head for all classes.

RG2806 commented 1 year ago

hey, thanks for clarifying. I made the changes like you said but i got empty bboxes on running evaluation. I tried to train a single class training. I am attaching the config.yaml and training log. 20221004_200442.log configs.txt PS: i don't have sweeps and map data so commented corresponding things from the data loading pipeline What am i doing wrong?

kentang-mit commented 1 year ago

Hi @RG2806,

I actually mentioned in other issues previously that we follow a two-step training schedule. In the first stage we train a LiDAR-only model (CenterPoint or TransFusion will be OK). We then finetune the camera+LiDAR BEVFusion model.

I would highly recommend you to also use such a schedule.

Best, Haotian

RG2806 commented 1 year ago

Hey, I used a the CenterPoint like you suggested with the neck and backbone remaining same. I trained a lidar-only model, but during the training i noticed the loss didn't go down. This continued till 6th epoch after which the losses became nan and the weights of backbone also became nan. I am attaching the training log and config. Can you guide here?

Apart from this, during loading nuscenes point cloud, 'load_dim=5'. what are these 5 dimensions? Are they (x,y,x,intensity,ring) or (x,y,z,intensity, timestamp) or something else? Also are these normalized?

configs.txt 20221005_183155.log

kentang-mit commented 1 year ago

For the five dimensions, it means (x, y, z, intensity, timestamp). I believe that for intensity you usually just divide it by the maximum possible value and for timestamp we discretize it according to the relative frame index. For some datasets (such as Waymo), the maximum value of intensity can be very large, and in this case it is suggested to normalize this dimension using the tanh function.

I had a look at your log, and I think you can probably start from a training config without GT paste and CBGS.

RG2806 commented 1 year ago

Hey @kentang-mit, I did a few changes like, normalising intensity with tanh; removing gt paste and cbgs. Loss is reducing atleast for first epoch. Can you shed a bit more light on the GT Paste part? and how can i debug it?

kentang-mit commented 1 year ago

I think GT paste is proposed in the SECOND paper from Yan et al., the idea is to create a database of all objects that appear in the training set and randomly paste some objects in the database onto each scene during training. GT paste has some improvement on nuScenes but I remember that it is small on the final fusion model (this is also observed by the authors of TransFusion).

To debug GT paste, I would suggest you visualize the point clouds (corresponding to these objects in the database) you generated and see whether they make sense (i.e. whether they look like real objects).

RG2806 commented 1 year ago

Hey @kentang-mit , Thanks for the clarification. I'll proceed accordingly. I have another question though. To fintetune the fusion model with lidar-only weights; do we use the 'load_from' key in the config. If so won't there be missing keys for camera layers?

kentang-mit commented 1 year ago

Yes, you will see missing keys for camera layers, but it does not matter. These layers are initialized either from ImageNet-pretrained checkpoints or models pretrained on other tasks (e.g. 2D detection), which is specified in this field.

RG2806 commented 1 year ago

Thanks for the details, Can you shed a bit more detail on the test time augmentation and model ensemble that you mentioned in your paper. Are these augmentations same as in the config and how many copies you created if so? And when you say model ensemble which models did you use to generate predictions?

kentang-mit commented 1 year ago

For TTA in the offboard settings, we used double flipping and rotation augmentations. For model ensemble, we use several models with different voxel resolutions and FPN architectures.

RG2806 commented 1 year ago

Thanks for the clarification and open sourcing this awesome work.

RG2806 commented 1 year ago

Hey @kentang-mit,

I wanted to train a lidar model on my custom dataset. In my dataset, all the objects are in front of the lidar, thus I to changed the point cloud range to [0, -54.0, -5.0, 108.0, 54.0, 3.0]. But with this change the model seems to error out during 1st epoch with cuda invalid configuration at hard_voxelise function. PS, what changes would i have to make to use dynamic voxelization instead hard voxelization

thanks

kentang-mit commented 1 year ago

Dynamic voxelization would require some changes to the code and we actually did not use the version in mmdetection3d. Do you have more information about the error? For example, are you close to running out of memory?

RG2806 commented 1 year ago

Hey @kentang-mit here is the error traceback: Traceback (most recent call last): File "tools/train.py", line 84, in main() File "tools/train.py", line 74, in main train_model( File "/bevfusion/mmdet3d/apis/train.py", line 136, in train_model runner.run(data_loaders, [("train", 1)]) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/bevfusion/mmdet3d/runner/epoch_based_runner.py", line 14, in train super().train(data_loader, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/bevfusion/mmdet3d/models/fusion_models/base.py", line 78, in train_step losses = self(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func output = old_func(*new_args, new_kwargs) File "/bevfusion/mmdet3d/models/fusion_models/bevfusion.py", line 188, in forward outputs = self.forward_single( File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func output = old_func(*new_args, *new_kwargs) File "/bevfusion/mmdet3d/models/fusion_models/bevfusion.py", line 245, in forward_single feature = self.extract_lidar_features(points) File "/bevfusion/mmdet3d/models/fusion_models/bevfusion.py", line 130, in extract_lidar_features feats, coords, sizes = self.voxelize(x) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func output = old_func(*new_args, *new_kwargs) File "/bevfusion/mmdet3d/models/fusion_models/bevfusion.py", line 140, in voxelize ret = self.encoders["lidar"]"voxelize" File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/bevfusion/mmdet3d/ops/voxel/voxelize.py", line 131, in forward return voxelization( File "/bevfusion/mmdet3d/ops/voxel/voxelize.py", line 55, in forward voxel_num = hard_voxelize( RuntimeError: CUDA error: invalid configuration argument CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

and i'm attaching my config file as well. I made changes to have objects only in front of lidar configs.txt

kentang-mit commented 1 year ago

Thanks for the information @RG2806. Would you mind launching a separate terminal and run watch nvidia-smi to see whether you are close to OOM on the GPU when running the code?

RG2806 commented 1 year ago

I will try to implement what you said. I also noticed one more thing during the training of the lidar-only model that the grad_norm always remains in 400 and does not go down. The change i made was, i reduced the lr to 1e-6 because i use 1 gpu, samples_per_gpu=1, workers_per_gpu=4. Apart from this i am attaching the log here. Can you help me understand why is this so.

I also noticed in other issues you mentioned that cordinate system of yours is different from mmdet, can you tell how?

20221124_195137.log

kentang-mit commented 1 year ago

@RG2806, there are more than one difference and I'm not sure I remember all of them:

lr=1e-6 looks clearly too small to me. So you can try increasing the learning rate.

RG2806 commented 1 year ago

@kentang-mit , can you clarify the following things for the coordinate system then, 1) point cloud coordinates system, is it same as the mmdet3d LiDAR Coordinate system ,i.e, x front, y left and z up

2) the gt bbox coordinate system, is it the same as the above? w, h, l corresponds to which directions yaw is zero on which axis? 3) will i have to do any other changes apart from the above. my dataset corresponds with new version of mmdet3d.

kentang-mit commented 1 year ago

Hi @RG2806,

I will try my best to recall these details (as we implemented these parts maybe 7-8 months ago).

  1. Your understanding of the coordinate system is correct and here is the visualization from the original mmdet3d repo we cloned around one year ago.

  2. The conversion between yaw in our codebase (r) and in latest mmdetection3d (r1) is given by:

    r = -np.pi / 2 - r1

    you can derive the axis relationships according to the visualizations in the official mmdet3d repo here.

For box dimensions, we store them in the order of lwh, and I believe l=x_size, w=y_size, h=z_size. You may double check on that.

  1. The most important change other than the coordinate system modification is that we switched from zyx voxelization in previous mmdet3d to xyz voxelization. In this case, the xy BEV coordinates on both CenterHead and TransFusionHead are transposed compared with the mmdet3d implementation. So I would suggest you to directly use our head and only use the dataloader from latest mmdet3d.

Hope that my explanations will be helpful. Here are also some tips for debugging. It will be helpful start from LiDAR-only models and see whether the results can match those in official papers (e.g. CenterPoint and TransFusion).

RG2806 commented 1 year ago

Thanks for the info

RG2806 commented 1 year ago

Hey @kentang-mit, My custom dataset, has more than11000 images and pcd. Because of this, the model is using too much ram, and gpu is relatively free. Can I migrate few of the steps to gpu if possible. And if so, where should i start?

kentang-mit commented 1 year ago

Hi @RG2806, is it possible for you to ellaborate more on "using too much ram"? For example, are you trying to load everything into the memory first? Besides, what are the operations you'd like to move to the GPUs? For example, are these preprocessing operations or data loading operations during training?

VeeranjaneyuluToka commented 1 year ago

Hi, This is interesting discussion, sorry for deviating a bit and am also working on towards the same (feeding the custom dataset). I used customized 3D-batch annotation tool (i modified myself to work with my own data) and my annotations are as below

**{"name":"000001","timestamp":0,"index":1,"labels":[{"id":0,"category":"sailboat","box3d":{"dimension":{"width":7.44,"length":6.765815099306366,"height":16.95},"location":{"x":45.24417122391511,"y":79.68626672516416,"z":6.632080000000001},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}}]}

{"name":"000000","timestamp":0,"index":0,"labels":[{"id":0,"category":"sailboat","box3d":{"dimension":{"width":7.44,"length":3.54,"height":16.95},"location":{"x":42.9372728225865,"y":78.29214534179232,"z":6.63208},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}}]}**

Am analysing further that how can i feed these annotations to BEVFusion (basically need to convert from this format to the format understandable by BEVFusion). Approach 1: converting these details or enhancing the above annotations to NuScenes annotations (like the database schema that they have). Approach 2: just create the input parameters that BEVFusion required.

Am still working on this, so would anyone of you recommend me some simple approach. @RG2806 , would be great help if you can describe here the approach that you followed to create annotations. @kentang-mit , would be really helpful if you can share any suggestions on this.

RG2806 commented 1 year ago

Hey @VeeranjaneyuluToka , My approach is similar to your first option. I created a new create_info script to create pickle files in nuscenes format and a separate dataset wrapper for my dataset with a different evaluation function.

kentang-mit commented 1 year ago

Hi @VeeranjaneyuluToka,

Sorry for the late response. I was working on other projects recently. I agree with @RG2806 on the choice of Approach 1. In my opinion your annotation format looks pretty similar to nuScenes. You should be able to reuse the code starting from line 245 in this file. You could specify a zero velocity if there is no annotation.

Best, Haotian

VeeranjaneyuluToka commented 1 year ago

@kentang-mit and @RG2806 , Thanks for your comments, I am a bit un-sure how to get ego_pose in my case. I had a look into the description given in nuscenes data-fromat section and it is as below Ego vehicle pose at a particular timestamp. Given with respect to global coordinate system of the log's map. The ego_pose is the output of a lidar map-based localization algorithm described in our paper. The localization is 2-dimensional in the x-y plane Am not sure if i understand it (esp lidar-based localization algo, tried to look into paper and understand but not figured out where exactly it is in paper), could you please give more details like how it can be computed in case of custom dataste?

Calibrated_sensor: I believe these are the camera and lidar calibration parameters in case if we use just camer and lidar. Extrinsics in case of both camera and lidar and intrinsics in case of camera. Is not it?

kentang-mit commented 1 year ago

Hi @VeeranjaneyuluToka,

The reason why you want to have ego pose is to align LiDAR scans from multiple sweeps in the same coordinate system. If you start with single-frame LiDAR + camera, I think you do not need ego pose. If you really want to get the ego pose, I think you might need to consult the nuScenes team, I'm sorry that I'm not an expert in that. For extrinsics and intrinsics, your understanding is correct, basically they are used to obtain the camera->LiDAR transformation.

Best, Haotian

kentang-mit commented 1 year ago

Closed due to inactivity. Please feel free to reopen if you feel it necessary.

VeeranjaneyuluToka commented 1 year ago

@RG2806 , how did you solve the above memory error when you change the point cloud range?