mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Flava image preprocessing #24

Closed DianeBouchacourt closed 1 year ago

DianeBouchacourt commented 1 year ago

Hello,

Since flava model has no image_preprocess (None), call to getitem will return PIL.Image objects, which can't be batched. Thus running model.get_retrieval_scores_batched with flava on for example VGR returns the error: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.Image.Image'>

How did you run your experiments?

Btw, unfortunately, I think turning PIL.Image into tensor with torchvision.transforms.ToTensor() messes up with the preprocessing, see:

from PIL import Image
import requests
from transformers import FlavaProcessor, FlavaModel, FlavaForPreTraining
import torchvision.transforms as torch_transforms
model = FlavaForPreTraining.from_pretrained("facebook/flava-full")
processor = FlavaProcessor.from_pretrained("facebook/flava-full")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(
  text=["a photo of a cat", "a photo of a dog"], 
  images=[image, image], 
  return_tensors="pt", 
  padding="max_length", 
  max_length=77,
  return_codebook_pixels=True,
  return_image_mask=True,
  # Other things such as mlm_labels, itm_labels can be passed here. See docs
)

inputs.bool_masked_pos.zero_()
print("Sum of pixel when passing an image", inputs['pixel_values'].sum())

image = torch_transforms.ToTensor()(image)
inputs = processor(
  text=["a photo of a cat", "a photo of a dog"], 
  images=[image, image], 
  return_tensors="pt", 
  padding="max_length", 
  max_length=77,
  return_codebook_pixels=True,
  return_image_mask=True,
  # Other things such as mlm_labels, itm_labels can be passed here. See docs
)

inputs.bool_masked_pos.zero_()
print("Sum of pixel when passing a tensor", inputs['pixel_values'].sum())

The returned sum of pixels values are respectively

Sum of pixel when passing an image tensor(75645.8438)
Sum of pixel when passing a tensor tensor(-501952.5625)
mertyg commented 1 year ago

See here. The notebook was just for quick demonstration, experiments were run with the main scripts, I recommend checking them out. These have collate fns we use for fixing the PIL Image issue. see here

DianeBouchacourt commented 1 year ago

Thanks ! Otherwise I just found about

torch_transforms.PILToTensor() https://discuss.pytorch.org/t/getting-typeerror-default-collate-batch-must-contain-tensors-numpy-arrays-numbers-dicts-or-lists-found-class-pil-image-image/161703 And this seems to return the same pixels values !

DianeBouchacourt commented 1 year ago

This way you don't have to play with collate_fn, looks like it doesn't scale the values (contrarily to ToTensor()) https://pytorch.org/vision/main/generated/torchvision.transforms.PILToTensor.html

https://pytorch.org/vision/main/generated/torchvision.transforms.ToTensor.html?highlight=totensor#torchvision.transforms.ToTensor

DianeBouchacourt commented 1 year ago

Nevermind, looks like the images are still of different sizes :( So I will follow your collate_fn trick, thanks

DianeBouchacourt commented 1 year ago

Would you have an example command to launch Flava on VGA with main scripts?

mertyg commented 1 year ago

I think something like in here should work. e.g.

model=flava
for dataset in VG_Relation VG_Attribution COCO_Order Flickr30k_order
do
    python3 main_aro.py --dataset=$dataset --model-name=$model --device=cuda
done
DianeBouchacourt commented 1 year ago

Thanks !

Btw do you plan on updating the number for all datasets now that the model.eval() has been fixed? Also, can you confirm you filter for VGA attributes with more than 25 examples? If that's the case, it looks like only 7099 examples are left out of the 28 748 reported in the paper..

mertyg commented 1 year ago
  1. Yes, I will update. I can tell you now that the numbers either marginally change or remain the same, and there are no significant changes. In any case, an update will come.
  2. The dataset we provide does include the samples, only the reported macro accuracy is over a filtered set of attribute pairs. I believe the choice of evaluation metric we used in the paper does not change the number of test cases we released, people can use different metrics over the entire set or can use the tail test cases for any other purpose.