mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
261 stars 15 forks source link

Will the CoCo-order and Flickr-order dataset be released? #6

Closed linzhiqiu closed 1 year ago

linzhiqiu commented 1 year ago

I noticed that the dataset creation process is random -- making it hard to compare to the number in your paper. Are you planning to release the exact shuffled captions for these two datasets? Or do you have any recommendation for reporting the performance?

vinid commented 1 year ago

Not sure if we plan to release this, but my suggestion right now is to do the experiments multiple times (we did 5, but you could do more) and then compute the averages. I do not expect huge variation.

linzhiqiu commented 1 year ago

Hey @vinid thank you for the response. I have another question regarding to the evaluate_scores() function for COCO and Flickr.

def evaluate_scores(self, scores):
        if isinstance(scores, tuple):
            scores_i2t = scores[0]
            scores_t2i = scores[1].T # Make it N_ims x N_text
        else:
            scores_t2i = scores
            scores_i2t = scores

        preds = np.argmax(np.squeeze(scores_i2t, axis=1), axis=-1)
        correct_mask = (preds == 0)
        result_records = [{"Precision@1": np.mean(correct_mask)}]
        return result_records

This does not align with that of VG-attribution and VG-relation in that the scores_i2t is scores[0] but not scores[1]. Is it a bug or is it intended?