Open tuan97ta opened 2 years ago
Maybe to add robustness to the system to not depend on all cameras for each segment of the final grid? That way you can also see how much the other camera's info might contribute to the one that is currently disabled.
@manueldiaz96, am not sure if i understand the robustness that you mentioned. Looks like no of cams can be from 1 to many and was thinking that it should be 1 to 6 as nuscenes dataset generated using 6 cameras but looks like that is not the case. I have read through the paper as well, they mentioned something here but not understood fully. 2x.We call the layer \Frustum Pooling" because it handles converting the frustums produced by n images into a xed dimensional CxHxW tensor independent of the number of cameras n. So let me know if you have any more details on the same. Thanks!
Hello @VeeranjaneyuluToka
The idea here is that by using the intrinsics, extrinsics, predicted categorical depth and sum pooling, you could theoretically do the projection of features coming from any number of cameras (N) to the 3D space centred in the robot.
The case for NuScenes limits us to choose between one and six, but the method theoretically works for any number of cameras (for example 1000 cameras). In practice, this number of cameras will be limited by your computational capabilities (GPU memory) and how fast you want the system to run.
I hope this helps
@manueldiaz96 , thanks for your reply and sorry for late comment again. Was actually debugging the code and analysing or trying to understand more. I still have a couple of questions on the same.
What do you mean by predicted categorical depth? I believe or if i understand correctly, the depth map/distribution is getting generated using certain interval (4.0, 45.0 and with the range 1.0). Is this just a initial random initailization and later it is tuned with an architecture (depthnet in the implementation) and that is what you mean predicted categorical depth?
I believe sum pooling which is getting calculated in voxel_pooling, is not it?
I have one more question, if we have extrinsics, intrinsics and 2D map (image in our case), then it is straght away, a 3D map can be calculated as mentioned in the below link. https://stackoverflow.com/questions/7836134/get-3d-coordinates-from-2d-image-pixel-if-extrinsic-and-intrinsic-parameters-are
Is it just becuase of the depth generation needs to be added in learning process, we are having initial depth distribution (4.0, 45.0 with range 1.0) and then tuning as i mentioned above? And i think projection falls down the approach stated in the stackoverflow link that i posted above?.
First, let me start from what I (and the authors, I think) mean with categorical depth. Imagine that you would like to predict what objects are in front of your robot up to 50 meters using only a camera. There are many approaches in the literature that tackle this problem (Monocular depth estimation - MDE) specifically.
In our case, we do not want to predict the depth for each pixel in the image, but instead, the depth for each pixel in the feature map given by the backbone (which has been downsampled). We can approach this problem by wanting to predict the exact depth value, which implies an infinite range of values between [0m, 50m]. This can be done following MDE approaches, but it would require proper depth ground truth and a specific depth prediction head architecture.
Instead, Philion and Fidler propose to leverage something from a previous paper (check figure 2) which is to define a set of planes (or depths) from the start and not have to deal with continuous depth prediction. For the case of LSS, they propose a range between [4m, 45m] from the camera, where each plane is separated by 1m.
If you see now, this is very similar to a classification problem since we have a proxy for a set of classes, which is the depth the pixel belongs to. This is the reason why it is called categorical depth, we are trying to classify if a pixel belongs to class 1,2,3,... which in this case means that the pixel has a depth of 4,5,6,... meters in front of the camera, and in practice it would be a vector containing the probability that the pixel belongs to a specific depth: [0.002, 0.004, 0.89,...], like it is the case in normal classification with NNs. It is due to this that pixels will only be put at 4,5,6,... meters, not 4.5, 5.78, 6.21 since these values are already defined when creating the frustum.
This is the distribution that the network needs to learn, which is called alpha in the paper or depth
in the camencode
model. Now, imagine that we have a perfect network, it would always predict a one-hot vector for the depth which the pixel belongs to: [0,0,0, ... ,1,0,0, ... ,0,0,0]. Which is exactly what they say in the first sentence after equation 1.
And since we are learning these alpha/depth values, per image, it is independent of the number of cameras we use to train, since we are doing the prediction of these alpha distributions based only on what there is in the image.
So to answer your questions specifically:
What do you mean by predicted categorical depth?
I mean that they define an amount of depth planes, which are positioned from 4m to 45m in front of the camera and separated by a distance of 1m. They are predefined when creating the frustum.
I believe or if i understand correctly, the depth map/distribution is getting generated using certain interval (4.0, 45.0 and with the range 1.0). Is this just a initial random initialization and later it is tuned with an architecture (depthnet in the implementation) and that is what you mean predicted categorical depth?
No, the depth values (or depth planes) are set up from the start. You can see this in the create_frustum
method, which takes the self.grid_conf
to find what is the extent of the grid, and creates an array with all the numbers using self.grid_conf['dbound']
as the args for the arange method. This is why it has an asterisk before the variable.
I believe sum pooling which is getting calculated in voxel_pooling, is not it?
I am not sure I understand your question. The Sumpooling is used in the voxel_pooling
method indeed, more specifically it is being used in line 229.
I have one more question, if we have extrinsics, intrinsics and 2D map (image in our case), then it is straght away, a 3D map can be calculated as mentioned in the below link.
No, this answer is misleading, because they even say: "Add them a z=1, so they are now 3d. Project them as follows" . Which means that they are choosing a specific depth. This depth they choose, is the one we are trying to learn in an approximate manner from the categorical depth distribution vector. It is not possible to project from 2D back to 3D by only knowing the intrinsics and extrinsics, if it was, then tasks such as MDE would not exist. The challenge is that since images are generated by the projection of light rays to the camera sensor, we are losing information from the 3D world by projecting it to the image plane.
Is it just because of the depth generation needs to be added in learning process?
Yes, we are learning the depth estimation (not sure generation would apply here).
We are having initial depth distribution (4.0, 45.0 with range 1.0) and then tuning as i mentioned above?
No, the depth planes are predefined and they do not change or tuned. What we are tuning is the categorical prediction to know what pixel belongs to which plane.
And i think projection falls down the approach stated in the stackoverflow link that i posted above?
No, it does not. As I said, they 'assume' a depth value. We want to estimate/learn this specifically.
I hope this helps!
If you want to better understand how cameras and projective geometry work, I recommend you this excellent series of blog posts about understanding the camera matrix, in the last post they have a very useful demo: Post 1, Part 2, Part 3 In brief, the intrinsic matrix tells you how does the light interacts with the sensor in the camera and the extrinsic matrix tells you how the camera is positioned in the 3D space. Neither of them give you information about how far away are the objects that are captured by the pixels in the image.
I want to know, why Ncams=5 when is training ? I found :