nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Training on 3D referential datasets, evaluating on 3D detection datasets #42

Closed ZCMax closed 11 months ago

ZCMax commented 11 months ago

In table 10, I want to know the method like "BUTD-DETR trained on SR3D/NR3D/ScanRefer" is only trained on referential data, or on referential data and detection prompts? If latter, I wonder we only train BUTD-DETR on referential data without detection prompt, can the trained model be directed evaluated on 3D detection benchmark, by replacing the visual grounding utterance with detection prompts?

ZCMax commented 11 months ago

Anther question is that: if we want to detect the tv in a scene, I find that the extracted text feature of tv of these two utterances: "tv. cabinet. monitor." and "tv. book. table." different, whether the detection prompt during evaluation and training should be the same?

ayushjain1144 commented 11 months ago

In table 10, I want to know the method like "BUTD-DETR trained on SR3D/NR3D/ScanRefer" is only trained on referential data, or on referential data and detection prompts? Latter

If latter, I wonder we only train BUTD-DETR on referential data without detection prompt, can the trained model be directed evaluated on 3D detection benchmark, by replacing the visual grounding utterance with detection prompts?: I think the detection prompt would be quite out of distribution if you only train for visual grounding, so I don't think it would work.

Anther question is that: if we want to detect the tv in a scene, I find that the extracted text feature of tv of these two utterances: "tv. cabinet. monitor." and "tv. book. table." different, whether the detection prompt during evaluation and training should be the same?: During training, I think we train by randomly shuffling the detection prompts. So even if TV "feature" is different, I would expect it to still work as queries can cross-attend to text and adapt based on it.

ZCMax commented 11 months ago

Thanks so much for your reply~