microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.14k stars 2.55k forks source link

Kosmos-1 and Kosmos-2: Difficulty Replicating Zero-Shot Image Classification Performance of Kosmos-1 with Kosmos-2 #1358

Open ShunsukeOnoo opened 1 year ago

ShunsukeOnoo commented 1 year ago

Description

I am engaged in research with Kosmos-2, aiming to replicate the Zero-Shot Image Classification with Descriptions task as detailed in Section 4.7 of the Kosmos-1 paper (figure). Unfortunately, I'm encountering challenges in matching the performance outcomes reported for Kosmos-1. The absence of published performance data for Kosmos-2 on this task leaves me uncertain whether the observed discrepancies stem from model variations or my implementation approach.

Screen Shot 2023-11-07 at 21 58 08

Inquiries

  1. Has there been an evaluation of Kosmos-2 on the Zero-Shot Image Classification task, and if so, may I inquire about the results?
  2. Would it be possible to access evaluation scripts, or datasets used for Kosmos-1 that would aid in benchmarking efforts?
  3. Are there any intentions to make the weights of Kosmos-1 available to the public in the near future?

Experimentation Details

For the replication study, I've created a dataset analogous to the one described in the Kosmos-1 paper, using the CUB dataset from Huggingface. My evaluation focuses on woodpecker and sparrow pairs, adopting descriptions from Table 11 of the Kosmos-1 paper. The penguin pair was excluded due to its absence in the dataset. My evaluation criterion measures the model's accuracy in text generation where the initial species name aligns with the actual name or acceptable variations thereof, factoring in punctuation. Notably, the accuracy was found to be 71.7% without descriptions and 61% with descriptions, which contrasts with the trends reported for Kosmos-1.

I would be grateful for any support you can provide and eagerly await your guidance.

Thedatababbler commented 9 months ago

I'm also testing the zero-shot reasoning capability of kosmos-2 and its not as promising as I read from the kosmos1 paper. Would you mind sharing your code on this evaluation on the CUB dataset so I can replicate more zero-shot experiments? Thank you very much.