steph1793 / Voxelnet

:mag_right: :oncoming_automobile: :articulated_lorry: :walking: :bike: 3D objects detections in LIDAR point clouds for autonomous driving
GNU General Public License v3.0
65 stars 52 forks source link

The meaning of several properties in dataset #2

Open kathy-lee opened 4 years ago

kathy-lee commented 4 years ago

Hi, thanks a lot for your nice Voxelnet code sharing! I am not so clear about several properties of dataset, could you explain it a little bit please? These 5 properties in the generation of dataset(data.py) "pos_equal_one", "neg_equal_one" ,"targets" , "pos_equal_one_reg" ,"pos_equal_one_sum" , "neg_equal_one_sum" . Many thanks in advance!

kathy-lee commented 4 years ago

Hi, could you please explain a little bit about the parameters 'cfg.LIDAR_COORD' in config.py? Is this the shift coordinate from camera to lidar? Many thanks!

steph1793 commented 4 years ago

Hi @kathy-lee First of all, sorry for the delay.

To understand those objects, I will first explain globally the overall architecture (sorry if redundant, I may be explaining some things you already you already know).

As explained in the paper, the voxelnet is made of Three parts : the Feature Learning Network which takes the voxelized pointcloud as input, the Convolutional NN and the region Proposal Network (RPN) which is the last part.

The RPN outputs two tensors : the regresion map and the probability map.

Why are they called maps? In fact, to explain it quiclkly, the voxenet takes as input a 3D grid where each cell has characteristics stored in a vector. The network then transforms this 3D grid into 2D maps where each cell can be roughly interpreted as the result of a birdview and contains a vector encoding some characteristics.

It is in this sense that the voxelnet will output for a pointcloud:

We've already seen why the two first dimensions of the map are X and Y (birdview). But what about the last dimension, which is the dimension of the vectors encoded for each cell, for both maps?

Before explaining this, let's recall that anchors are the possible bounding boxes we can parameterize in our pointcloud grid space. Of course the real bounding boxes will not match some of those anchors perfectly, but well defined anchors must be in a way that real bounding boxes can find strong match (very close) with some anchors. We have positive anchors which are the ones with a strong match with the real bounding boxes and the negative ones which do not contain the object. The bounding boxes have 7 features : (x,y,z) the center, w,l, h the width, lenght, height, and r the rotation around Z axis.

We have defined wh anchor bounding boxes where each bounding box has two orientations (90° and 0° around Z axis), this has been defined by the authors for simplicity which gives a total of wh*2 anchors.

Well, for the regression map, Voxelnet will output 2 bounding boxes (explaining why the last dimension is 14); actually they are the same bounding boxes at a given cell but with a different orientation (this is why I wanted us to remind the anchors defined).

And for the probability map, we also output 2 probabilities for each bounding box (at every cell of the map), each probability representing the chance that we have a bounding box centered at the given cell with one of the 2 orientations.

PS : Voxelnet does not really work like object detections networks like YOLO able to learn to detect multiple objects. When you build and a train a Voxelnet, it is for a specific SINGLE object.

Now that all of this has been explained , I will now actually answer your question:

The objects that you pointed can be considered (for some of them) as masks during training.

I hope that it answers your question ;).

steph1793 commented 4 years ago

Also, for the LIDAR_COORD, it is a shift to place the pointclouds on the coordinate system we will be working on (for a better view of the cars or the pedestrians, I guess); but this is not a shift from the camera to the lidar. You will find the methods that actually do that in the utils script.

kathy-lee commented 4 years ago

Hi @steph1793 , thanks a lot for your very comprehensive reply! I'm trying to apply your library on another dataset. With your explanation and the voxelnet paper now I'm much clearer with these variables and their functions in your code. One more question is MATRIX_P2, MATRIX_T_VELO_2_CAM and MATRIX_R_RECT_0 in config.py, I am not clear about what is their use, if I want to use another dataset, can I just use every P2 T_VELO_2_CAM MATRIX_R_RECT_0 read from the calibration file instead of using these three from config.py?

kathy-lee commented 4 years ago

Hi @steph1793 , I have one more question about the roation angle conversion in definition of camera_to_lidar_box: (x, y, z), h, w, l, rz = camera_to_lidar(x, y, z, T_VELO_2_CAM, R_RECT_0), h, w, l, -ry - np.pi / 2. It seemsrzin camera coordinate system is directly convert to lidar coordinate system by-ry -pi/2`, shouldn't it go through the cam-to-velo transform matrix? Many thanks!