The meaning of several properties in dataset

kathy-lee commented 4 years ago

Hi, thanks a lot for your nice Voxelnet code sharing! I am not so clear about several properties of dataset, could you explain it a little bit please? These 5 properties in the generation of dataset(data.py) "pos_equal_one", "neg_equal_one" ,"targets" , "pos_equal_one_reg" ,"pos_equal_one_sum" , "neg_equal_one_sum" . Many thanks in advance!

kathy-lee commented 4 years ago

Hi, could you please explain a little bit about the parameters 'cfg.LIDAR_COORD' in config.py? Is this the shift coordinate from camera to lidar? Many thanks!

steph1793 commented 4 years ago

Hi @kathy-lee First of all, sorry for the delay.

To understand those objects, I will first explain globally the overall architecture (sorry if redundant, I may be explaining some things you already you already know).

As explained in the paper, the voxelnet is made of Three parts : the Feature Learning Network which takes the voxelized pointcloud as input, the Convolutional NN and the region Proposal Network (RPN) which is the last part.

The RPN outputs two tensors : the regresion map and the probability map.

The regression map is the tensor used to predict the "bounding boxes" (not really the bounding boxes, I will be more specific later, for now this can do).
The probability map is the tensor used to predict if either or not there is an object at a given position of the map.

Why are they called maps? In fact, to explain it quiclkly, the voxenet takes as input a 3D grid where each cell has characteristics stored in a vector. The network then transforms this 3D grid into 2D maps where each cell can be roughly interpreted as the result of a birdview and contains a vector encoding some characteristics.

It is in this sense that the voxelnet will output for a pointcloud:

a probability map of shape [X,Y, 2]
- a regression map of shape [X,Y, 14]

We've already seen why the two first dimensions of the map are X and Y (birdview). But what about the last dimension, which is the dimension of the vectors encoded for each cell, for both maps?

Before explaining this, let's recall that anchors are the possible bounding boxes we can parameterize in our pointcloud grid space. Of course the real bounding boxes will not match some of those anchors perfectly, but well defined anchors must be in a way that real bounding boxes can find strong match (very close) with some anchors. We have positive anchors which are the ones with a strong match with the real bounding boxes and the negative ones which do not contain the object. The bounding boxes have 7 features : (x,y,z) the center, w,l, h the width, lenght, height, and r the rotation around Z axis.

We have defined wh anchor bounding boxes where each bounding box has two orientations (90° and 0° around Z axis), this has been defined by the authors for simplicity which gives a total of wh*2 anchors.

Well, for the regression map, Voxelnet will output 2 bounding boxes (explaining why the last dimension is 14); actually they are the same bounding boxes at a given cell but with a different orientation (this is why I wanted us to remind the anchors defined).

And for the probability map, we also output 2 probabilities for each bounding box (at every cell of the map), each probability representing the chance that we have a bounding box centered at the given cell with one of the 2 orientations.

PS : Voxelnet does not really work like object detections networks like YOLO able to learn to detect multiple objects. When you build and a train a Voxelnet, it is for a specific SINGLE object.

Now that all of this has been explained , I will now actually answer your question:

The objects that you pointed can be considered (for some of them) as masks during training.

pos_equal_one : with a shape of [X, Y, 2] is used to mask the probability map, in order to extract the probability of positive anchors and to compute the loss L_cls defined in the paper. In this tensor, we have ones for positive anchors and zeros for negative anchors.
neg_equal_one : the opposite of the previous tensor : we mask out the probabilties of the positive anchors and keep the probabilities of the negative anchors to compute the L_cls loss for neg defined in the paper.
pos_equal_one_reg : this is mask for the regression map, same shape by the way. Ones for the positive anchors , and zeros for the negative anchors.
pos_equal_one_sum : For each pointcloud, we compute the total number of positive anchors, used in the loss computation.
neg_equal_one_sum : total number of negative anchors in a pointcloud.
targets : Note that during the loss computation, the loss is not computed directly between the real bounding boxes and the predicted bounding boxes. Voxelnet does not even output directly the real bounding boxes. Instead it outputs a delta (a difference) between the real bounding boxes and the parameterized ones (and since we have those parameterized boxes, we can compute the real one from the output of the NN). So the loss is computed between the predicted deltas and the real deltas. You can find more details in the section 2.2 of the paper. So here, targets, will be the real deltas between the ground truth boxes and the parameterized positive anchors.

I hope that it answers your question ;).

steph1793 commented 4 years ago

Also, for the LIDAR_COORD, it is a shift to place the pointclouds on the coordinate system we will be working on (for a better view of the cars or the pedestrians, I guess); but this is not a shift from the camera to the lidar. You will find the methods that actually do that in the utils script.

kathy-lee commented 4 years ago

Hi @steph1793 , thanks a lot for your very comprehensive reply! I'm trying to apply your library on another dataset. With your explanation and the voxelnet paper now I'm much clearer with these variables and their functions in your code. One more question is MATRIX_P2, MATRIX_T_VELO_2_CAM and MATRIX_R_RECT_0 in config.py, I am not clear about what is their use, if I want to use another dataset, can I just use every P2 T_VELO_2_CAM MATRIX_R_RECT_0 read from the calibration file instead of using these three from config.py?

kathy-lee commented 4 years ago

Hi @steph1793 , I have one more question about the roation angle conversion in definition of camera_to_lidar_box: (x, y, z), h, w, l, rz = camera_to_lidar(x, y, z, T_VELO_2_CAM, R_RECT_0), h, w, l, -ry - np.pi / 2. It seemsrzin camera coordinate system is directly convert to lidar coordinate system by-ry -pi/2`, shouldn't it go through the cam-to-velo transform matrix? Many thanks!

steph1793 / Voxelnet

The meaning of several properties in dataset #2

Now that all of this has been explained , I will now actually answer your question: