ouenal / scribblekitti

Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)
https://ouenal.github.io/scribblekitti/
143 stars 17 forks source link

Segmentation Performance with Partially Annotated Data #1

Closed ldkong1205 closed 2 years ago

ldkong1205 commented 2 years ago

Thank you for open-sourcing your annotated data and code!

Regarding Table 2 in your paper, I have a question about the segmentation performance of Cylinder3D and SparseConv-UNet (Ref [18] in your reference).

The results under the 10% frame split are 46.8% for Cylinder3D and 43.9% for SparseConv-UNet. I have recently run experiments on Cylinder3D with the same number of labeled training frames (1913 out of 19130) and got much higher results (55%+). I am using the latest version of Cylinder3D from here. I use exact configurations provided by the authors except for init_size. I replaced it with 16 (originally set as 32). I would like to know how you exactly implement this and what is the potential cause for such a huge performance difference. Thanks!

ouenal commented 2 years ago

That is a huge difference indeed. How did you select the 10% labeled frames? What we've done is run with the first 10% of each sequence. There should be a big difference when you run with 10% uniformly sampled from the entire dataset (e.g. selecting the first of each 10 consecutive frames). Remember the idea is to simulate equal annotation times. Labeling is done on concatenated point clouds, which means annotating an entire sequence isn't all that different than annotating the same sequence with a sensor that has 1/10th of the frequency. Also to help with some of your future experiments, the Cylinder3D in this repository does lag behind the original implementation for some reason. I haven't figured this out at all since the model is taken directly from the original repository. Even when fully trained it converges to something like 1% behind the original.

ldkong1205 commented 2 years ago

That is a huge difference indeed. How did you select the 10% labeled frames? What we've done is run with the first 10% of each sequence. There should be a big difference when you run with 10% uniformly sampled from the entire dataset (e.g. selecting the first of each 10 consecutive frames). Remember the idea is to simulate equal annotation times. Labeling is done on concatenated point clouds, which means annotating an entire sequence isn't all that different than annotating the same sequence with a sensor that has 1/10th of the frequency.

Also to help with some of your future experiments, the Cylinder3D in this repository does lag behind the original implementation for some reason. I haven't figured this out at all since the model is taken directly from the original repository. Even when fully trained it converges to something like 1% behind the original.

Thank you for your reply. Now it makes sense because I uniformly sampled 10% of the training data instead of just selecting the first 10%. I need to point out that select the first x% data might hurt the diversity, since they are likely from the same and subsequent scenes. For the sampling strategy here, in my opinion, uniformly selecting samples from the whole training set seems more in line with the practical situation.

I would also like to point out that since you are annotating the whole data set via scribble labels, the diversity is thus maintained. Therefore, your results are much higher than the first x% strategies. But is this comparison indeed fair? Looking forward to your opinion. Thanks!

ouenal commented 2 years ago

While asking these questions, I think you've made the exact point we were trying to convey with our paper. Weak labels with lot's of diversity (i.e. more data) is much better than full labels with less. I would highly suggest checking Xu et al.'s 2020 CVPR paper here where they do experiments on this very subject and mathematically show why uniform sampling is expected to converge to a much closer performance. As for my opinion, I think selecting 10% of the labels as we've done is justified. Remember, it all boils down to fixing the labeling budget (check existing work on semi-supervised PCSS here where the same strategy was chosen). Labeling an entire sequence with 10% uniformly sampled frames is not that different than labeling the entire sequence, since when labeling you concatenate all point clouds. You end up doing the same work but on less dense tiles (see our supplementary materials to check the interface for LiDAR point cloud labeling). In fact, losing density may cause even more time during object recognition, i.e. determining what label to choose.

ldkong1205 commented 2 years ago

While asking these questions, I think you've made the exact point we were trying to convey with our paper. Weak labels with lot's of diversity (i.e. more data) is much better than full labels with less. I would highly suggest checking Xu et al.'s 2020 CVPR paper here where they do experiments on this very subject and mathematically show why uniform sampling is expected to converge to a much closer performance.

As for my opinion, I think selecting 10% of the labels as we've done is justified. Remember, it all boils down to fixing the labeling budget (check existing work on semi-supervised PCSS here where the same strategy was chosen). Labeling an entire sequence with 10% uniformly sampled frames is not that different than labeling the entire sequence, since when labeling you concatenate all point clouds. You end up doing the same work but on less dense tiles (see our supplementary materials to check the interface for LiDAR point cloud labeling). In fact, losing density may cause even more time during object recognition, i.e. determining what label to choose.

Thank you for your detailed clarification. I totally agree with your statement that scribbly annotating more samples is a better option than densely annotating partial scans.

My point here is that the comparison fairness. I think maybe it is better adding both of the results, i.e., results from the first 10% and results from uniformly sampled 10% data, and make some discussions. The current conclusion in your paper, quoted as "As seen, both models perform significantly better using scribble annotations compared to having full annotations on 10% of the train-set by up to +10.2% and +11.1% mIoU" can thus becoming more accurate. From the results of the uniformly sampled 10% data, the performance gap is not that huge.

I am not familiar with the exact point cloud annotation process, but from my person point of view, obtaining LiDAR scans from multiple scenes is a relatively an easy operation. Thus, picking certain samples from each scene collected seem reasonable to me. That's why I am using uniform selection strategy.

Thanks again for your reply and your idea!