pxiangwu / MotionNet

CVPR 2020, "MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps"
170 stars 25 forks source link

some problem about the motionnet postprocess #3

Closed muzi2045 closed 4 years ago

muzi2045 commented 4 years ago

First, thanks for your work! there has some place in postprocess part make me confused

# We only show the cells having one-hot category vectors
max_prob = np.amax(pixel_cat_map_gt, axis=-1)
filter_mask = max_prob == 1.0
pixel_cat_map = np.argmax(pixel_cat_map_gt, axis=-1) + 1  # category starts from 1 (background), etc
pixel_cat_map = (pixel_cat_map * non_empty_map * filter_mask).astype(np.int)

cat_pred = np.argmax(cat_pred, axis=0) + 1
cat_pred = (cat_pred * non_empty_map * filter_mask).astype(np.int)

the tensor cat_pred output by motionnet looks like is the catrgory of per pixel in lidar bev map. but what the filter mask mean in this part ? cat_pred = (cat_pred * non_empty_map).astype(np.int) will also output normal result

the filter_mask tensor related to pixel_cat_map_gt value, but if I test MotionNet on my own lidar data, that means I have no any GT-boxes annotations, and the filter_mask may be can't be compute. I have no idea my understanding is correct, Hope for any reply! @pxiangwu

pxiangwu commented 4 years ago

Hi, the filter_mask is used to retrieve the pixels which have category vectors like [0, 0, 0, 0, 1]. That is, we only want the one-hot category vectors. The reason I did this is that not all category vectors have one-hot vectors, a result from data preprocessing. Specifically, for a grid cell, it happens that points from different object categories fall within this same cell. In this case, the category vector associated with this cell will be like [0.1, 0, 0, 0.5, 0.4]. So in the experiment, I did not consider such cells and just ignored them.

To generate filter_mask, yes, we need GT-box annotations. This comes naturally because if we want to know the displacement of a cell into the future, we may need to know which object this cell belongs to and the future position of this object. That is why we may need the box annotations. Actually, most of the recent public dataset, such as Argoverse, nuScenes, Lyft etc, come with such box annotations. However, for your own lidar data, if the box annotations are unavailable but you still have an approach to obtaining the future displacement of cells, you may not need filter mask and the pre-processing code, and could just directly extract the cell displacements.

Hope this helps. Feel free to contact me if you have further questions.

muzi2045 commented 4 years ago

thanks for reply! it mean If I want to use inference it with raw lidar data, I can only classify the cell depends on the category vectors. And I try it in my own data bag, there will have some noise pixel and require really high quality location info between lidar and map to align past 5 frames to newest lidar data. for example: 605 the noise pixel will generate abnormal velo value if there has no GT-boxes infos to filter this noise pixles.

pxiangwu commented 4 years ago

Hi, actually the GT-box annotations are used:

  1. During training, to generate ground-truth displacement vectors for each cell and provide cell category information;
    1. During performance evaluation, to provide GT displacement and cell categories.

Thus, if you do not conduct performance evaluation (e.g., compute the mean error) and just run the pretrained model, you may not need the GT annotations.

The reason you observed noise pixels in the above picture could be:

  1. There is some domain difference/gap between your own data and nuScenes data. For example, the lidar data of nuScenes is 32-line (very sparse!). One natural approach to closing this gap is to train a new model on your own data.
  2. The lack of high-quality location information to align the input frames well. Good ego-compensation is very important. Actually, one future work is: Can we still achieve good performance under flawed/weak ego-compensation?
  3. MotionNet still has further room for improvement. Currently this work is the first attempt to handle such problem, and it could be made better to further reduce the prediction error in the future. This might be the reason you observed the noise pixels.
  4. MotionNet was trained on 500 scenes, a dataset which is not so large. In industry, to achieve very good performance, we typically train on extremely large dataset (e.g., 10000 scenes).

Hope this helps.