Closed ljj7975 closed 4 weeks ago
@jianzongwu @xushilin1 See this question.
Thank you for raising this important question. We appreciate the opportunity to clarify.
You are correct that GLIP and GroundingDINO are not included in OVOD (Table 1) due to fundamental differences in task formulation. GLIP and GroundingDINO focus on detecting specific objects referred to by a text query, such as "the red car" or "the largest apple," which requires understanding descriptive attributes provided at inference time. In contrast, OVOD assumes that a set of class names (both base and novel) is provided before inference, and the model searches for objects that belong to any of these predefined classes.
Regarding your point on the claim made in the Visual Grounding Tasks section (Section 3.2), we recognize that the phrase "without the given text information, such as class names" may have been misleading. What we intended to convey is that open-vocabulary object detection tasks do not require user-provided input queries in the form of descriptive text at inference time, as is common in visual grounding tasks. Instead, OVOD utilizes class names, including potential novel classes, in an implicit manner for recognition.
To clarify: visual grounding tasks provide a specific textual phrase that directly describes the object of interest, including both class and unique attributes. OVOD, on the other hand, uses class names without focusing on specific instances described by additional attributes.
Thank you for highlighting this confusion. We will revise the wording in our next draft to ensure it is clearer, particularly by removing the phrase "such as class names" and replacing "without the given text information" with "without user-provided query text" to accurately reflect the difference in input requirements.
What is the reasoning behind GLIP and GroundingDINO not being present in OVOD (Table 1)?
Is it because GLIP/GroundingDINO detects specific object referred by the text, While OVOD assumes that user provides a set of class names (base + novel) prior to inference, where model searches for the correct one out of the candidates?
I think I got even more confused as a reader because of your claim in Visual Grounding Tasks paragraphs of section 3.2. "Open vocabulary learning tasks require the model to automatically detect, segment and recognize new objects WITHOUT THE GIVEN TEXT INFORMATION, SUCH AS CLASS NAMES (?????), which is more challenging" If your answer to the question above is yes, then this claim is not correct because OVOD indirectly utilizes class names.
Can you clarify? Thanks.