mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

slow evaluation for xvlm #23

Closed lezhang7 closed 1 year ago

lezhang7 commented 1 year ago

Hi, When I try to reproduce the results of xvlm with the following scrips

model=xvlm-coco # Choose the model you want to test

for dataset in VG_Relation VG_Attribution 
do
    python3 main_aro.py --dataset=$dataset --model-name=$model --device=cuda --batch-size 768
done

the evaluation is quite slow, takes around 15 mins for a single dataset, is there anything wrong with my scripts since clip takes only around 2 mins for the same dataset?

mertyg commented 1 year ago

XVLM/BLIP models are quite a bit more computationally expensive compared to CLIP, this makes sense to me