Open hhaAndroid opened 7 months ago
Thanks for your interest in our work.
I think the main reason is we have constructed some negative queries during training visual grounding data, which is described in the last paragraph of Sec. 3.2.
We also construct Image-centri Grounding Samples
(Sec. 3.3), and the model will learn all objects in an image described by sentences simultaneously, which could also improve performance.
By mentioning "reject" and "negative", do you mean that techniques like contrastive learning are used?
If not, then I am a bit confused. Because, intuively, concatenating the postive language queries (describing objects in the images) with the negative ones (describing objects that don't exist) and then making it interact with visual features, is like introducing noise in the features, right?
Without contrastive loss or other manipulation, how could the model explicitly learn to reject irrelevant prompts, and get higher performance? Please correct me if I am misunderstanding.
We believe the model will learn to denoise as we use noisy tokens for fusion and supervise it with ground-truth signals.
As we formulate grounding as detection, all prompts can be seen as object classes. When the model is trained in the detection way, it will learn to predict low scores for negative classes.
This is a fantastic job, and I have a question: why has the performance of d3's dataset improved so much? It seems relatively reasonable for other datasets to show improvement. I look forward to your response.