Open kathy-lee opened 4 years ago
Hi, could you please explain a little bit about the parameters 'cfg.LIDAR_COORD' in config.py? Is this the shift coordinate from camera to lidar? Many thanks!
Hi @kathy-lee First of all, sorry for the delay.
To understand those objects, I will first explain globally the overall architecture (sorry if redundant, I may be explaining some things you already you already know).
As explained in the paper, the voxelnet is made of Three parts : the Feature Learning Network which takes the voxelized pointcloud as input, the Convolutional NN and the region Proposal Network (RPN) which is the last part.
The RPN outputs two tensors : the regresion map and the probability map.
Why are they called maps? In fact, to explain it quiclkly, the voxenet takes as input a 3D grid where each cell has characteristics stored in a vector. The network then transforms this 3D grid into 2D maps where each cell can be roughly interpreted as the result of a birdview and contains a vector encoding some characteristics.
It is in this sense that the voxelnet will output for a pointcloud:
We've already seen why the two first dimensions of the map are X and Y (birdview). But what about the last dimension, which is the dimension of the vectors encoded for each cell, for both maps?
Before explaining this, let's recall that anchors are the possible bounding boxes we can parameterize in our pointcloud grid space. Of course the real bounding boxes will not match some of those anchors perfectly, but well defined anchors must be in a way that real bounding boxes can find strong match (very close) with some anchors. We have positive anchors which are the ones with a strong match with the real bounding boxes and the negative ones which do not contain the object. The bounding boxes have 7 features : (x,y,z) the center, w,l, h the width, lenght, height, and r the rotation around Z axis.
We have defined wh anchor bounding boxes where each bounding box has two orientations (90° and 0° around Z axis), this has been defined by the authors for simplicity which gives a total of wh*2 anchors.
Well, for the regression map, Voxelnet will output 2 bounding boxes (explaining why the last dimension is 14); actually they are the same bounding boxes at a given cell but with a different orientation (this is why I wanted us to remind the anchors defined).
And for the probability map, we also output 2 probabilities for each bounding box (at every cell of the map), each probability representing the chance that we have a bounding box centered at the given cell with one of the 2 orientations.
PS : Voxelnet does not really work like object detections networks like YOLO able to learn to detect multiple objects. When you build and a train a Voxelnet, it is for a specific SINGLE object.
The objects that you pointed can be considered (for some of them) as masks during training.
pos_equal_one : with a shape of [X, Y, 2] is used to mask the probability map, in order to extract the probability of positive anchors and to compute the loss L_cls defined in the paper. In this tensor, we have ones for positive anchors and zeros for negative anchors.
neg_equal_one : the opposite of the previous tensor : we mask out the probabilties of the positive anchors and keep the probabilities of the negative anchors to compute the L_cls loss for neg defined in the paper.
pos_equal_one_reg : this is mask for the regression map, same shape by the way. Ones for the positive anchors , and zeros for the negative anchors.
pos_equal_one_sum : For each pointcloud, we compute the total number of positive anchors, used in the loss computation.
neg_equal_one_sum : total number of negative anchors in a pointcloud.
targets : Note that during the loss computation, the loss is not computed directly between the real bounding boxes and the predicted bounding boxes. Voxelnet does not even output directly the real bounding boxes. Instead it outputs a delta (a difference) between the real bounding boxes and the parameterized ones (and since we have those parameterized boxes, we can compute the real one from the output of the NN). So the loss is computed between the predicted deltas and the real deltas. You can find more details in the section 2.2 of the paper. So here, targets, will be the real deltas between the ground truth boxes and the parameterized positive anchors.
I hope that it answers your question ;).
Also, for the LIDAR_COORD, it is a shift to place the pointclouds on the coordinate system we will be working on (for a better view of the cars or the pedestrians, I guess); but this is not a shift from the camera to the lidar. You will find the methods that actually do that in the utils script.
Hi @steph1793 ,
thanks a lot for your very comprehensive reply! I'm trying to apply your library on another dataset. With your explanation and the voxelnet paper now I'm much clearer with these variables and their functions in your code. One more question is MATRIX_P2
, MATRIX_T_VELO_2_CAM
and MATRIX_R_RECT_0
in config.py, I am not clear about what is their use, if I want to use another dataset, can I just use every P2
T_VELO_2_CAM
MATRIX_R_RECT_0
read from the calibration file instead of using these three from config.py?
Hi @steph1793 ,
I have one more question about the roation angle conversion in definition of camera_to_lidar_box
:
(x, y, z), h, w, l, rz = camera_to_lidar(x, y, z, T_VELO_2_CAM, R_RECT_0), h, w, l, -ry - np.pi / 2. It seems
rzin camera coordinate system is directly convert to lidar coordinate system by
-ry -pi/2`, shouldn't it go through the cam-to-velo transform matrix? Many thanks!
Hi, thanks a lot for your nice Voxelnet code sharing! I am not so clear about several properties of dataset, could you explain it a little bit please? These 5 properties in the generation of dataset(data.py) "pos_equal_one", "neg_equal_one" ,"targets" , "pos_equal_one_reg" ,"pos_equal_one_sum" , "neg_equal_one_sum" . Many thanks in advance!