GLIP - Githubissues

twangnh commented 5 months ago

Thanks for sharing the wonderful work, the paper differentiate GLIP with GroundingDINO, FIBER, the former is classified into open vocabulary object detection, while the latter is named bi-functional model(detect and reference comprehention), since GLIP can also be used for DOD (e.g., in omnilabel paper), could you please give more dissucssion on this ?

Charles-Xie commented 4 months ago

Hi,

Thanks for your interest in our work. I think the problem you raised is insightful and very worthy of discussion. In our paper, we mainly discuss detection, REC (which is a representative of grounding) and their conjunction - DOD. We group existing methods by the tasks they are evaluated in their papers. As GLIP is tested for detection and Phrase Grounding (also a grounding task), we do not include it as a bi-functional methods for the sake of academic rigor. However, from the broader view of the conjunction between detection and grounding, GLIP is surely one of the representative and pioneering works that pave the way for the DOD task. If we look beyond strict task forms and take DOD as the union of general detection and grounding, I think methods like GLIP, MDETR, FIBER, G-DINO are aimed at the same goal and all have the potential for DOD. Currently, GLIP is also evaluated on D3 and its performance (19+ intra-full-mAP) is very close to more recent works like G-DINO even with the limited model size and data resource. We would be very happy to discuss this further in the new version of the paper.

Best, Chi

twangnh commented 4 months ago

@Charles-Xie Hi Chi, thanks for the reply. I'm wondering what do you think about the similarity and differences between D3 and Omnilabel dataset(OmniLabel: A Challenging Benchmark for Language-Based Object Detection)?

Charles-Xie commented 4 months ago

@twangnh Thanks for the interesting question.

Omnilabel is a great work and I'm happy to see two works with similar motivations appear in a short time, which may show the direction of these works is promising and possibly acknowledged by some researchers in the community.

I will try to answer this below as a discussion, and the following only standards for my personal opinion. If I'm understanding omnilabel correctly, I think both datasets can be regarded as a dataset for Described Object Detection (or, language-based object detection) in some sense. They both provide images with positive descriptions (associated with boxes for an image) and negative descriptions (associated with no boxes for an image).

The difference is also significant: For omnilabel, the annotators design some positive and negative descriptions based on each image. The annotation style is more similar to REC datasets, but with negative descriptions. As a result, the description categories in this dataset are more than in DOD, and more diverse. However, one description may appear or not in another image, so the annotation is not complete on the dataset level, and only complete on the image level. This also results in less negative instances annotated. For $D^3$, we design the descriptions for the whole dataset first, then the annotators annotate them on all images for positive and negative labels. The annotation style is more similar to detection datasets. The description categories are not as many as in omnilabel, but each category is completely labeled on the whole dataset like a standard detection dataset. This brings more negative instances to distinguish for a model, which can be challenging. Some other differences may also exist, but I think the above explains the most important ones.

This is only my personal opinion. Thanks for asking. We hope to see more methods and datasets towards this direction.

shikras / d-cube

GLIP #12