mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Code for generating negatives for training NegCLIP #13

Closed HarmanDotpy closed 1 year ago

HarmanDotpy commented 1 year ago

I was wondering if you could also release the code related to the creation of the dataset (having text and image negatives) which was used to then train the NegCLIP model.

In particular, the part I am looking for the few functions that may have been used for the swapping of elements in the sentence as mentioned in the image

Screenshot 2023-03-30 at 1 48 27 PM

Thanks!

mertyg commented 1 year ago

Ah sorry for missing this, thanks for the issue @HarmanDotpy ! Here's a gist containing that piece: https://gist.github.com/mertyg/03e638fef99cbd8c3a108e8bacd16a6c We'll clean it up and push it sometime soon, but feel free to use this in the meantime.

HarmanDotpy commented 1 year ago

awesome, thanks for such a quick reponse!