mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Calling model.eval() when computing scores otherwise non-deterministic results (torch._no_grad_() is not enough) #17

Closed DianeBouchacourt closed 1 year ago

DianeBouchacourt commented 1 year ago

It seems you do not call model.eval() before computing the scores (at least in the notebook example). Indeed, torch._no_grad_() does not disable modules like Dropout This raises problems, as scores are non deterministic:

It might also raise problems for other models but I haven't checked. Shouldn't we add a call to model.eval()?

mertyg commented 1 year ago

Yes, I think you are right, thanks for bringing this up. Definitely should have added `.eval(). Just pushed an update adding it. Sorry for missing this.

I quickly ran NegCLIP, CLIP, BLIP and XVLM after the change, getting 0.804, 0.590, 0.585 and 0.736 for VG-R respectively.