How can I get images projected from multiple cameras to bev

mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

https://bevfusion.mit.edu

Apache License 2.0

2.25k stars 408 forks source link

How can I get images projected from multiple cameras to bev #195

Closed Ehangiun closed 1 year ago

Ehangiun commented 1 year ago

Hello，thank you for your contributions at fusion After I read your code ,i find the function ' BaseDepthTransform' in 'bevfusion/mmdet3d/models/vtransforms/base.py'. 'get_geometry' seem to create a point cloud of data through the simulated image as your papar said,and variable 'depth' mean point cloud transform to image and get depth information ,but 'get_cam_feats' combines the image information with the depth information and outputs the shape (B,N,D,H,W,C) ,and then to downsampling with the shape of get_geometry. How can I get images projected from multiple cameras to bev

kentang-mit commented 1 year ago

I'm not sure whether I understand your questions correctly, but are you saying that you want to visualize these projected images in the RGB space?

Ehangiun commented 1 year ago

I'm not sure whether I understand your questions correctly, but are you saying that you want to visualize these projected images in the RGB space?

Thank you for your reply ,I means visualize these projected images in the brid's eye view space just likes project the image to a bird's eye view

kentang-mit commented 1 year ago

I see, what we projected to the BEV space is actually camera features, not RGB pixels, so we do not have an intuitive visualization for that.

Ehangiun commented 1 year ago

Because the accuracy of the camera-only of my test was low, I wanted to know what the camera looked like when it was in the bird 's-eye view ,I have tested the effect of a single image on the dataset, but when passing in internal and external parameters to my own camera, the accuracy drops dramatically

kentang-mit commented 1 year ago

I see, in this case I would suggest you to visualize the predictions from camera-only models directly. We did that to verify whether we implemented the projections correctly on datasets other than nuScenes. We have a tools/visualize.py to help you with that.

Ehangiun commented 1 year ago

I see, in this case I would suggest you to visualize the predictions from camera-only models directly. We did that to verify whether we implemented the projections correctly on datasets other than nuScenes. We have a tools/visualize.py to help you with that. Thanks，this helps me solve the problem. By the way, I want to know how do you solve the mapping relationship between multiple cameras in the brid's eye view space ,like pic Is it a fixed Angle mapping or a separate mapping relationship for each camera？

kentang-mit commented 1 year ago

It is a separate mapping relationship for each camera. This function will be very helpful for you to understand the mapping between camera and LiDAR coordinate systems: https://github.com/mit-han-lab/bevfusion/blob/0e5b9edbc135bf297f6e3323249f7165b232c925/mmdet3d/models/vtransforms/base.py#L79.

Ehangiun commented 1 year ago

It is a separate mapping relationship for each camera. This function will be very helpful for you to understand the mapping between camera and LiDAR coordinate systems:

https://github.com/mit-han-lab/bevfusion/blob/0e5b9edbc135bf297f6e3323249f7165b232c925/mmdet3d/models/vtransforms/base.py#L79 .

The function get_geometry, this is where I get confused, and here's what I think: when I get a frustum to map each pixel of the image, the rgb value of the original pixel is converted to the coordinate point, I am confused about this, is this my think mistake?

kentang-mit commented 1 year ago

Sorry for the delayed response. We did not map RGB values to 3D, instead we project high-dimensional features to the BEV space. Hope that makes sense to you.

Ehangiun commented 1 year ago

Thank you for your reply，I think I found my answer through the discussion！

HaiyangPeng commented 1 year ago

Hi @Ehangiun @kentang-mit . as shown below, I find that the camera features in bev have no significant object information (e.g., boundary, contour) compared with lidar features, so my question is how can the camera branch learn the object features and use them to complete downstream task (e.g., detection, segmentation).

GaoPeng97 commented 10 months ago

Hi @Ehangiun @kentang-mit . as shown below, I find that the camera features in bev have no significant object information (e.g., boundary, contour) compared with lidar features, so my question is how can the camera branch learn the object features and use them to complete downstream task (e.g., detection, segmentation).

After the model converges，we will get some significant feature with object information in the camera branch. for example, the visualized feature bellow is from the decoder of the camera branch.