Can you release the dataset during the finetuning of NegCLIP? Thanks!

HarmanDotpy commented 1 year ago

same question, was about to open an issue. in particular, the mscoco_with_negatives_training.csv is required for this.

thanks!

vinid commented 1 year ago

Temporary added data here will probably have to move this somewhere else but this is the fastest option to give you access to this right now

HarmanDotpy commented 1 year ago

thank you! just confirming, the negative images/text in the validation set are not really used right? (for training or for anything else)

vinid commented 1 year ago

both models (CLIP and NegCLIP) are trained on 5 epochs but I think the training breaks if the file is missing

Note that we still pick the best model based on retrieval on MSCOCO validation set (see appendix in https://arxiv.org/pdf/2210.01936.pdf*)

*(camera-ready version should come in a week or so)

HarmanDotpy commented 1 year ago

I see, just to clarify a couple of more things

I see the train_neg_clip.tsv file has images from the validation set for the first few lines of the tsv file. So I believe the split of train and val sets for training negclip might be different from what the original splits are. for eg. the first line if the train.tsv file is 0 A woman marking a cake with the back of a chef's knife. coco/val2014/COCO_val2014_000000522418.jpg "[""A back marking a cake with the woman of a chef 's knife .""]" [68050, 80838, 68663] so in light of this, by "MSCOCO validation set", do you mean the valid_neg_clip.tsv file?
in the valid_neg_clip.tsv, for each text-image pair, there are hard negative images and text as well, so these are not really used in any way for validation right? (or for getting the best model after hyperparameter sweep)
Visual Genome dataset is made from COCO dataset images. is it safe to assume that there may be atleast some images/text that may be common in the ARO dataset and the COCO dataset?

mertyg / vision-language-models-are-bows