nv-tlabs / lift-splat-shoot

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D (ECCV 2020)
Other
993 stars 209 forks source link

Depth distribution and context vector of a pixel #33

Open VeeranjaneyuluToka opened 1 year ago

VeeranjaneyuluToka commented 1 year ago

Wanted to understand more of lift stage. Basically they mentioned in the paper that lift stage is where a 2D to 3D conversion happens.
As a first step in this process ' generate representations at all possible depths for each pixel' which is what they call it is D which gets generated from 4.0 to 45 with range of 1.0. Basically Depth distribution is getting generated from 4.0 to 45.0 with the range of 1.0, is not it? Then there is something called context vector C, am not sure how this gets generated for each pixel.

Would be great help if there anybody gives a little more eplanation on both of these (Depth distribution and Context vector of each pixel)?

manueldiaz96 commented 1 year ago

Context vector are the features. Both of them are generated in the last layer of the Camera encoder. Here you can see the context vector, called depth and the features called new_x. The first D values from the output of the depthnet module in camencode are the categorical depth scores, and the rest are the image features.

VeeranjaneyuluToka commented 1 year ago

@manueldiaz96 Thanks for your comment. Let me summerize what happens in CamEncode and check if my understanding is correct.

Extract features from each input image sample by considering the last 2 blocks of efficientnet (reduction_5 and reduction_4), lets say they are x1 and x2. Upsample x1 and fuse with x2 (i believe this is kind of FPN to handle different scaling), lets say this as fx. Then run depthnet on the fx (using softmax here) not really sure why it is so and this is what categorical depth that you mentioned and stored in first D values.

new_x, is this a projection of categorical depth on image features?

manueldiaz96 commented 1 year ago

Extract features from each input image sample by considering the last 2 blocks of efficientnet (reduction_5 and reduction_4), lets say they are x1 and x2.

Yes, that is what the code does.

Upsample x1 and fuse with x2 (i believe this is kind of FPN to handle different scaling), lets say this as fx.

You can just call it x as they do in the code. But yes, this is what is happening in line 81 of the models.py file.

Then run depthnet on the fx (using softmax here) not really sure why it is so and this is what categorical depth that you mentioned and stored in first D values.

Look at the definition of depthnet, it is a simple 2D convolution that has 512 channels as inputs and D+C channels as output. Depthnet gives us both the categorical depth distributions (depth) and the context vector (new_x).

They split the features given by depthnet into two groups: the first D channels are used to predict the categorical depth distributions, these are the only features that get passed by a softmax operation here; then, the C features left from the vector (whose second dimension is of shape D+C), are the context features, or just image features. They are later multiplied to find what is called $c_d$ on equation 1.

new_x, is this a projection of categorical depth on image features?

Using the information from the depth (categorical depth distribution, referenced as $\alpha$ in equation 1) variable, the features in x[:, self.D:(self.D + self.C)] (context vector, referenced as $c$ in equation 1) get multiplied by the categorical depth distribution to obtain new_x. This new_x is the scaled context vector $c_d$ in equation 1, this multiplication does not mean projection.

As they describe in the 4th paragraph in subsection 3.1:

Screenshot from 2022-09-27 15-37-47

These new_x features are then projected to 3D in the voxel_pooling function using the predefined depth planes.

Where D is the number of discrete depth values we are considering as stated in the paper (subsection 3.1, third paragraph) :

Screenshot from 2022-09-27 15-09-00

Look at the definition of categorical depth distributions in the Categorical Depth Distribution Network for Monocular 3D Object Detection paper, specifically subsection 1.1 :

Screenshot from 2022-09-27 15-16-06

ZiyuXiong commented 1 year ago

Hi there, just a few more questions about the choice of D, as I found no ablation study about it:

  1. why limit the range of D to [4, 45], considering the range of xbound and ybound are both [-50, 50];
  2. the resolution dismatch of D(1m) and grid bins(0.5m), why don't keep the resolution the same, to ease the depth projection error Thanks~
manueldiaz96 commented 1 year ago

@ZiyuXiong, although I am not the author of the paper, the following is my intuition:

why limit the range of D to [4, 45], considering the range of xbound and ybound are both [-50, 50];

I would think that the lower limit pertains to a safe area around the car (including itself), since the reference frame is located in the rear axle. The upper limit, I am not sure.

the resolution dismatch of D(1m) and grid bins(0.5m), why don't keep the resolution the same, to ease the depth projection error

I am not sure it would ease the depth projection error, or at least not that much for vehicles (given their normal dimensions). Matching delta D to the grid resolution will result in an increased processing time (at least doubling the time used to do the projection, and even more if you increase the original image input size), where the gaps between the planes in the BEV (1m = 2 pixels) could be easily filled in by the bevencode network instead. The bevencode only needs some cues from the images, to find the final segmentation.

So I would guess that it is all about what compromises you can make to have the best output, at the latency you desire.