[Question] Why has the performance of d3 improved so much?

shenyunhang / APE

[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception

https://arxiv.org/abs/2312.02153

Apache License 2.0

464 stars 28 forks source link

[Question] Why has the performance of d3 improved so much? #1

Open hhaAndroid opened 7 months ago

hhaAndroid commented 7 months ago

This is a fantastic job, and I have a question: why has the performance of d3's dataset improved so much? It seems relatively reasonable for other datasets to show improvement. I look forward to your response.

shenyunhang commented 7 months ago

Thanks for your interest in our work.

I think the main reason is we have constructed some negative queries during training visual grounding data, which is described in the last paragraph of Sec. 3.2.

It compensates for the loss of fine-grained information in sentence-level embedding.
It makes the model learn to reject irrelevant prompts.

We also construct Image-centri Grounding Samples (Sec. 3.3), and the model will learn all objects in an image described by sentences simultaneously, which could also improve performance.

Masaaki-75 commented 6 months ago

By mentioning "reject" and "negative", do you mean that techniques like contrastive learning are used?

If not, then I am a bit confused. Because, intuively, concatenating the postive language queries (describing objects in the images) with the negative ones (describing objects that don't exist) and then making it interact with visual features, is like introducing noise in the features, right?

Without contrastive loss or other manipulation, how could the model explicitly learn to reject irrelevant prompts, and get higher performance? Please correct me if I am misunderstanding.

shenyunhang commented 5 months ago

We believe the model will learn to denoise as we use noisy tokens for fusion and supervise it with ground-truth signals.

As we formulate grounding as detection, all prompts can be seen as object classes. When the model is trained in the detection way, it will learn to predict low scores for negative classes.