mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Concrete benchmark results of attributes understanding #10

Closed Yangyi-Chen closed 1 year ago

Yangyi-Chen commented 1 year ago

Hi, thanks for your great work!

Could you also provide concrete results of attributes understanding? Currently, the paper only shows figures and some concrete benchmark results of relation understanding in the Appendix.

Great thanks!

vinid commented 1 year ago

Hello!!! thank you!!

could you elaborate a bit more on what you'd like to see?

vinid commented 1 year ago

closing this for now now but feel free to reopen it!

Yangyi-Chen commented 1 year ago

Hi! Sorry for the late reply.

I would like to see fine-grained results in Visual Genome Attribute dataset, just like Table 2 shows results for the visual genome relation dataset.

If there is too much tedious work to collect the results, could you just provide each model's performance on the attribute dataset?

vinid commented 1 year ago

Hello!

The problem with that table is that it's very long and kind of difficult to read. I'll search for the csv with the attribute results, but it might be easier and faster just to use the notebook in colab if you just want to collect them!

Yangyi-Chen commented 1 year ago

great thanks!