shikras / shikra

Other
710 stars 44 forks source link

What is the difference between your work and kosmos-2 ? #7

Open dongxiaolong opened 1 year ago

dongxiaolong commented 1 year ago

Could you tell me what's you differ from kosmos-2 ? I haven't seen your comparison with kosmos-2, but I think the two jobs are relatively similar.

jwh97nn commented 11 months ago

Position representations? KOSMOS-2 uses additional location tokens while this work does not.

zzhanghub commented 10 months ago

Thank you all for your attention to Shikra.

KOSMOS-2 presents impressive work. However, Shikra had already submitted for peer review before the public release of KOSMOS-2. Additionally, both were uploaded to arXiv nearly on the same day, which is why the comparison was not included in the manuscript. There are numerous differences between the two approaches, and in my perspective, the most crucial disparities are as follows:

Position Representation: The two models differ in their representation of position. KOSMOS-2 adopts a position representation similar to Pix2seq, OFA, and Unified-IO, incorporating an additional vocabulary. On the other hand, Shikra employs a natural language form. Training Data: Kosmos-2 leverages spaCy and GLIP to create an extensive pseudo-labeled dataset called GRIT (with 130M+ bounding boxes). In contrast, Shikra utilizes human-annotated positioning-related data to grasp input-output position understanding, supplemented by a small yet high-quality dataset (Shikra-RD).

Furthermore, the two models differ significantly in terms of model structure, model size, training strategy, and instructional format, etc.

Empowering MLLMs with referring and grounding capabilities holds promising potential. We also look forward to more researchers delving into this new area, bringing forth intriguing work and in-depth analyses.