nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Discussion about experimental evaluation in Table 1 #9

Closed Hiusam closed 1 year ago

Hiusam commented 1 year ago

Hi there,

Firstly, I want to say that your work is great. Good job!

I noticed in Table 1 of your paper that you evaluated your method against previous works using both ReferIt3D and ScanRefer benchmarks. I have a couple of thoughts on this.

Regarding the ReferIt3D benchmark, I don't think it's necessary to re-train the previous works with the Group-Free 3D detector. Instead, using ground truth (GT) boxes for all methods is sufficient to demonstrate the superiority of your approach.

As for the ScanRefer benchmark, it seems that you directly used the reported results from previous works, which may not be entirely fair since you used additional bounding boxes obtained by the Group-Free 3D detector. It might be better to retrain the prior methods with these additional bounding boxes to make the comparison more equitable.

What do you think about these points?

Thanks and best regards.

ayushjain1144 commented 1 year ago

Hi Hiusam,

Regarding Referit3D: While the comparison with GT boxes could have be sufficient, we purposefully evaluated in Det Setup which does not assume access to ground-truth bounding boxes. This is because it is a more realistic setup considering that we don't usually have access to GT boxes in the real world. Additionally, the approaches which just score the bounding boxes are box-bottlenecked i.e. if they do use an object detector and it misses to detect an object in the first place, the whole pipeline is bound to fail. This limitation naturally wouldn't reveal itself in the setup that assumes ground-truth boxes.

Scanrefer: For SAT2D, we do use group-free boxes as they only provide results with ground truth proposals (the +sign is missing from the table). 3DVG Transformer predict the boxes as part of their model and I think cannot use an external object detector (at least trivially). InstanceRefer uses panoptic segmentation. For FFL-3DOG, you are right that we could have used group-free boxes instead of votenet boxes that they use. However, from Table-2, you would notice that BUTD-DETR without box-stream (i.e. without any external detector) is only 1% worse than the full model which uses box-stream. Hence, I would expect our model to significantly outperform FFL-3DOG irrespective of the detector.

Let us know if you have any questions!