nv-tlabs / lift-splat-shoot

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D (ECCV 2020)
Other
1.06k stars 221 forks source link

Questions about create_frustum and voxel_pooling #36

Open GYGWG opened 2 years ago

GYGWG commented 2 years ago

Hi, thanks for your excellent work! I am a little bit confused about functions create_frustum and voxel_pooling. It will be great if you can give some further explanations.

In create_frustum, the code indicates that the output dimension is D x H x W x 3, I am wondering what is this 3 represents for? Is it RGB value? Or is it the coordinate position of point in frustum? I am also wondering whether the input to this function is raw image or extracted feature?

For voxel_pooling, what I understand is that it sums up the features of all the points in a same voxel(pillar) using cumsum trick. The dimension of output in this function is B x C x Z x X x Y, where X Y and Z are the coordinates in the BEV field(which are not the same with H W and D). However, in the paper it says "perform sum pooling to create a CxHxW tensor" which really confused me. Why are we still want H and W here? Besides, I am wondering how you get rid of Z?

GYGWG commented 2 years ago

I am also confused when understanding function get_geometry. It says "Determine the (x,y,z) locations (in the ego frame)"; however, the output dimension is still B x N x D x H/downsample x W/downsample x 3. I assume X Y and Z are matched to H/downsample W/downsample and D in this case? Again, I am wondering what does this 3 stand for?

manueldiaz96 commented 2 years ago

In create_frustum, the code indicates that the output dimension is D x H x W x 3, I am wondering what is this 3 represents for? Is it RGB value? Or is it the coordinate position of point in frustum?.

It is the tuple (x,y,z) that indicates the 3D coordinates of the point, you can see this since all the Z values for any given D are the same. This is because LSS tries to learn where the objects are using depth planes.

I am also wondering whether the input to this function is raw image or extracted feature?

For create_frustum, there are no inputs. What the use is the original image shape, which is an internal variable of the model together with the downsampling factor to find the final width and height of the final feature map after the encoder backbone, which would be the extracted features.

However, in the paper it says "perform sum pooling to create a CxHxW tensor" which really confused me. Why are we still want H and W here?

Because they are used to find the proper xyz coordinates for each pixel in each depth plane D. They aren't used for anything else, if I am not mistaken.

Besides, I am wondering how you get rid of Z?

You get rid of Z by performing the sum pooling, which takes all points in a voxel (discretization of 3D space) of infinite height, and then add them all together. Therefore, summing all features that may appear in the same cell in the BEV, where you cannot distinguish their Z component.

I assume X Y and Z are matched to H/downsample W/downsample and D in this case?

No, XYZ are just a 3D vector that is assigned to each pixel (which has coordinates DHW), doing this is how you manage to associate each pixel in the features to their projection in 3D