Kosmos-1 and Kosmos-2: Difficulty Replicating Zero-Shot Image Classification Performance of Kosmos-1 with Kosmos-2

Description

I am engaged in research with Kosmos-2, aiming to replicate the Zero-Shot Image Classification with Descriptions task as detailed in Section 4.7 of the Kosmos-1 paper (figure). Unfortunately, I'm encountering challenges in matching the performance outcomes reported for Kosmos-1. The absence of published performance data for Kosmos-2 on this task leaves me uncertain whether the observed discrepancies stem from model variations or my implementation approach.

Inquiries

Has there been an evaluation of Kosmos-2 on the Zero-Shot Image Classification task, and if so, may I inquire about the results?
Would it be possible to access evaluation scripts, or datasets used for Kosmos-1 that would aid in benchmarking efforts?
Are there any intentions to make the weights of Kosmos-1 available to the public in the near future?

Experimentation Details

For the replication study, I've created a dataset analogous to the one described in the Kosmos-1 paper, using the CUB dataset from Huggingface. My evaluation focuses on woodpecker and sparrow pairs, adopting descriptions from Table 11 of the Kosmos-1 paper. The penguin pair was excluded due to its absence in the dataset. My evaluation criterion measures the model's accuracy in text generation where the initial species name aligns with the actual name or acceptable variations thereof, factoring in punctuation. Notably, the accuracy was found to be 71.7% without descriptions and 61% with descriptions, which contrasts with the trends reported for Kosmos-1.

I would be grateful for any support you can provide and eagerly await your guidance.

microsoft / unilm