nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

How to implement referit3d on scanrefer benchmark? #45

Closed RunsenXu closed 9 months ago

RunsenXu commented 9 months ago

Dear Authors,

Thank you for your great work!

I noticed that you reported a result of ReferIt3DNet on ScanRefer, but I think the ReferIt3DNet accepts object point clouds(GT boxes) but there are no GT boxes available for the ScanRefer benchmark, how did you do?

I conjecture for the ScanRefer benchmark, you use the group-free detector to detect the boxes first and then use these boxes for ReferIt3DNet without the scene point cloud. Is that correct?

Best, Runsen

nickgkan commented 9 months ago

Hi Runsen, thanks for your interest in our work! ReferIt3D reports results on ScanRefer, so we copied the numbers from their paper.

RunsenXu commented 9 months ago

Dear author,

Thank you so much for your quick reply. Do you mean the Table 4 in referit3D paper? But I think the results is scanrefer network on scanrefer dataset w/w.o. sr3D data.

image

And the referit3d paper "This demonstrates the contribution of adding a synthetically generated dataset to a human one. We get a similar outcome when combining Sr3D to the ScanRefer [18] data (see Table 4). We performed this experiment following the implementation in [17]." further confirms that it's scanrefer network, not referit3d network.

Do I miss something?

Best, Runsen

nickgkan commented 9 months ago

Hi Runsen,

To be honest, it was not clear to me from their text which model they use. What you're saying makes sense, but the results they report were not consistent with ScanRefer. After checking at ScanRefer's previous versions on arxiv, is seems that this ScanaRefer version "(xyz+rgb+lobjcls)" in their very first submission round https://arxiv.org/pdf/1912.08830v1.pdf is what ReferIt3D reports. It seems that ScanRefer authors updated their arxiv results more than once, so this confusion was created.

In any case, if we were to reproduce the ScanRefer results of ReferIt3D we would use the same way we did for det results on Sr3D and Nr3D, which is to run Group-Free, save the boxes in the format ReferIt3D expects and then use those with their code. That said, I would expect the results to be very poor.

RunsenXu commented 9 months ago

Oh, I see. And I think the ScanRefer results in Table 1 are somehow not a fair comparison as it seems that you also copied TGNN, SAT's results, but the result of TGNN(and ScanRefer) did not use a pre-trained detector to obtain boxes while SAT did. What do you think? I am not doubting the superiority of your method, of course.

ayushjain1144 commented 9 months ago

I think TGNN starts with segmentation masks and not bounding boxes, and hence it's not non-trivial to supply bounding boxes to it. Also, the critical point is that the "pretrained" detector is just trained on scannet and not utilising any additional data, so using it vs not does not really introduce much unfairness I think. (SAT uses 2D pre-trained object detectors, we do not utilise any 2D data / pre-training)

RunsenXu commented 9 months ago

Yeah, that makes sense.