mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
261 stars 15 forks source link

mismatching results on compositional task #9

Closed lezhang7 closed 1 year ago

lezhang7 commented 1 year ago

hi, when I try to reproduce the results on ARO, I can't get the scores, the code is for dataset in VG_Relation VG_Attribution do for resume in scratch/open_clip/src/Outputs/negclip/checkpoints/epoch_0.pt do python3 main_aro.py --dataset=$dataset --model-name=$model --resume=$resume --batch-size=$bs --device=cuda --download done done

and I just got VG_Relation 68.11 VG_Attribution 42.16 instead of 81 and 71 as reported in table6

lezhang7 commented 1 year ago

I was also confused, in table2 you report clip on VG_Relation scores 59 while in table6 It becomes 63, am I missing something?

vinid commented 1 year ago

Hello!!

which model is this negclip/checkpoints/epoch_0.p? how was it trained? with which parameters?

0.63 is for CLIP-FT (the one fine-tuned on mscoco), CLIP score is still 0.59.

Screenshot from 2023-03-14 20-38-03

lezhang7 commented 1 year ago

thanks for quick response, i just rename the checkpoint from your gdown script, however, I tried this for dataset in VG_Relation VG_Attribution do python3 main_aro.py --dataset=$dataset --model-name=NegCLIP --device=cuda --batch-size=$bs done the results are still the same, 68.11 and 42.17

lezhang7 commented 1 year ago

by the way, i calculate the acc by acc=(df['Accuracy']*df['Count']).mean() don't know if it should be this way, but I got 59 on VG-R for clip which matches the reported scores, so guess this is correct

lezhang7 commented 1 year ago

Hello!!

which model is this negclip/checkpoints/epoch_0.p? how was it trained? with which parameters?

0.63 is for CLIP-FT (the one fine-tuned on mscoco), CLIP score is still 0.59.

Screenshot from 2023-03-14 20-38-03 but when you take a look at this one, it doesn't match

Screenshot 2023-03-14 at 11 48 53 PM
vinid commented 1 year ago

oh you are right, that is a typo from the arxiv version, we should update that. This is the camera ready version of the paper: https://openreview.net/forum?id=KRLUvxh8uaX

(preparing an answer for the reproducibility soon)

lezhang7 commented 1 year ago

i see, you compute marco accuracy instead of accuracy, could you share the code for computing the macro accuracy?

vinid commented 1 year ago

You should find all you need in the reproducibility notebook described here.

Macro accuracy on the relation dataset is just the average of the accuracy of each relation. If you use our evaluation wrapper it should be just: df["Accuracy"].mean().

Quick colab version

lezhang7 commented 1 year ago

thank you, still, when i test the openai-clip:ViT-B/32, the macro accuracy on VG-Relation is 63, which matches the preprint instead of camera ready version, suggesting 59 is calculated by acc=(df['Accuracy']*df['Count']).mean() while 63 is acc=df['Accuracy'].mean(), I guess you should report 63 instead of 59

vinid commented 1 year ago

just run the colab with openAI's CLIP and got 0.59, could you try to see what's missing starting from that?

lezhang7 commented 1 year ago

just run the colab with openAI's CLIP and got 0.59, could you try to see what's missing starting from that?

why use symmetric df = pd.DataFrame(vgr_records) df = df[~df.Relation.isin(symmetric)] print(f"VG-Relation Macro Accuracy: {df.Accuracy.mean()}")

instead of directly take the mean?

vinid commented 1 year ago

Note that we don't use symmetric relations.

The problem is that if a relation is symmetric you have that r(X,Y) = r(Y,X).

For example given and image of a cat close to a dog, both "close(Cat,Dog)" and "close (Dog,Cat)" are true. Models would just pick one of the captions at random and thus it's not a very informative relation to study (unless maybe for some bias analysis). Hence, we drop symmetric relationships.

vinid commented 1 year ago

Closing this for now, let me know if you have other questions!

HarmanDotpy commented 1 year ago

Hi @vinid I was trying out the colab you shared above, but i changed the model to NegCLIP. In particular i changed one line to

model, preprocess = get_model(model_name="NegCLIP", device="cuda", root_dir=root_dir)

I am getting

VG-Relation Macro Accuracy: 0.8021811864440539 VG-Attribution Macro Accuracy: 0.7055937135374111

Just wanted to confirm if this is correct, especially the Relation accuracy.

vinid commented 1 year ago

Hello!

Before computing the scores, did you also apply

df = df[df["Count"] > 9]

this is a commented instruction

HarmanDotpy commented 1 year ago

I didn't try that earlier. Uncommenting it gives:

VG-Relation Macro Accuracy: 0.8038109510723876

vinid commented 1 year ago

the difference is small but let me look into this

HarmanDotpy commented 1 year ago

yup, not a big issue, but just wanted to confirm if this is the correct number.

Thanks for looking into it

vinid commented 1 year ago

yea makes total sense, thanks for pointing this out

DianeBouchacourt commented 1 year ago

Hey, I ran the same and got with df = df[df["Count"] > 9 VG-Relation Macro Accuracy: 0.803816692506885 Commenting it gives VG-Relation Macro Accuracy: 0.8021892603363159

DianeBouchacourt commented 1 year ago

Also, on OpenAI Clip I get df = df[df["Count"] > 9 -> VG-Relation Macro Accuracy: 0.5923217479726929 Commenting it gives -> VG-Relation Macro Accuracy: 0.5946534905762514

vinid commented 1 year ago

Hi! thanks! The CLIP one matches the one in the paper

DianeBouchacourt commented 1 year ago

Also, funnily if you use torch tensors (and not numpy) +cuda to compute the scores and then the accuracy, you get VG-Relation Macro Accuracy for CLIP of 0.599128631916311 with df = df[df["Count"] > 9

mertyg commented 1 year ago

Thank you all! I think, somehow, most of the comments above are correct. 😄

  1. First of all, in the paper, if you look at Table 2 we wrote 0.80 for VG-R, vs in Table 3 we wrote 0.81. This is an honest mistake, we are sorry about this.

  2. As for the reason, I think @DianeBouchacourt is spot on here. This appears to be due to the non-determinism in cuda. (e.g. see here or official pytorch docs). If you do the computations in cuda you get that ~0.002-5ish difference in perf, and our legacy code (before cleaning and releasing) is doing that.

mertyg commented 1 year ago

I think overall nondeterminism can get pretty tricky in the context of the VG-R dataset. The embeddings can be pretty close, and minor differences in embeddings due to nondeterminism can lead to differences around ~0.002-5