Open dongxiaolong opened 1 year ago
Position representations? KOSMOS-2 uses additional location tokens while this work does not.
Thank you all for your attention to Shikra
.
KOSMOS-2
presents impressive work. However, Shikra
had already submitted for peer review before the public release of KOSMOS-2
. Additionally, both were uploaded to arXiv nearly on the same day, which is why the comparison was not included in the manuscript. There are numerous differences between the two approaches, and in my perspective, the most crucial disparities are as follows:
Position Representation: The two models differ in their representation of position. KOSMOS-2
adopts a position representation similar to Pix2seq, OFA, and Unified-IO, incorporating an additional vocabulary. On the other hand, Shikra
employs a natural language form.
Training Data: Kosmos-2
leverages spaCy and GLIP to create an extensive pseudo-labeled dataset called GRIT (with 130M+ bounding boxes). In contrast, Shikra
utilizes human-annotated positioning-related data to grasp input-output position understanding, supplemented by a small yet high-quality dataset (Shikra-RD).
Furthermore, the two models differ significantly in terms of model structure, model size, training strategy, and instructional format, etc.
Empowering MLLMs with referring and grounding capabilities holds promising potential. We also look forward to more researchers delving into this new area, bringing forth intriguing work and in-depth analyses.
Could you tell me what's you differ from kosmos-2 ? I haven't seen your comparison with kosmos-2, but I think the two jobs are relatively similar.