Closed ZCMax closed 11 months ago
Anther question is that: if we want to detect the tv in a scene, I find that the extracted text feature of tv
of these two utterances: "tv. cabinet. monitor."
and "tv. book. table."
different, whether the detection prompt during evaluation and training should be the same?
In table 10, I want to know the method like "BUTD-DETR trained on SR3D/NR3D/ScanRefer" is only trained on referential data, or on referential data and detection prompts?
Latter
If latter, I wonder we only train BUTD-DETR on referential data without detection prompt, can the trained model be directed evaluated on 3D detection benchmark, by replacing the visual grounding utterance with detection prompts?
: I think the detection prompt would be quite out of distribution if you only train for visual grounding, so I don't think it would work.
Anther question is that: if we want to detect the tv in a scene, I find that the extracted text feature of tv of these two utterances: "tv. cabinet. monitor." and "tv. book. table." different, whether the detection prompt during evaluation and training should be the same?
: During training, I think we train by randomly shuffling the detection prompts. So even if TV
"feature" is different, I would expect it to still work as queries can cross-attend to text and adapt based on it.
Thanks so much for your reply~
In table 10, I want to know the method like "BUTD-DETR trained on SR3D/NR3D/ScanRefer" is only trained on referential data, or on referential data and detection prompts? If latter, I wonder we only train BUTD-DETR on referential data without detection prompt, can the trained model be directed evaluated on 3D detection benchmark, by replacing the visual grounding utterance with detection prompts?