mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Can you release the dataset during the finetuning of NegCLIP? Thanks! #3

Closed mu-cai closed 1 year ago

HarmanDotpy commented 1 year ago

same question, was about to open an issue. in particular, the mscoco_with_negatives_training.csv is required for this.

thanks!

vinid commented 1 year ago

Temporary added data here will probably have to move this somewhere else but this is the fastest option to give you access to this right now

HarmanDotpy commented 1 year ago

thank you! just confirming, the negative images/text in the validation set are not really used right? (for training or for anything else)

vinid commented 1 year ago

both models (CLIP and NegCLIP) are trained on 5 epochs but I think the training breaks if the file is missing

Note that we still pick the best model based on retrieval on MSCOCO validation set (see appendix in https://arxiv.org/pdf/2210.01936.pdf*)

*(camera-ready version should come in a week or so)

HarmanDotpy commented 1 year ago

I see, just to clarify a couple of more things