Closed roelschr closed 1 year ago
Hi,
If you want to run inference on individual frames, you'll need to use a model that expects individual frames, not videos.
Here you're loading microsoft/git-base-vatex, hence, it expects pixel_values
of shape (batch_size, num_frames, num_channels, height, width).
To run inference on a batch of images, you can use models which are trained on image captioning datasets, like microsoft/git-base, microsoft/git-base-coco, microsoft/git-base-textcaps (as well as any of the large variants).
Edit; after investigating it still seems like there's an error. Looking into this
Sorry, I should have also shown that it doesn't work on captioning models (even though I have tested it on my side, both git-base-coco
and git-large-coco
). My bad!
I appreciate that you're looking into this 🙏
For some reason Github isn't automatically linking the PR that will fix it: #21071
Update: seems that the PR above doesn't fix it. So issue remains open
Ok figured this out! The problem is that you're not passing input_ids
of the same batch size. By default, the generate method will just use the start token ID (which for GIT equals model.config.bos_token_id = 101). However when sending a batch of images through the model, we also need to prepare a batch of start tokens.
The following works:
from transformers import AutoProcessor, AutoModelForCausalLM
import requests
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("microsoft/git-base-coco")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-coco")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = torch.stack([pixel_values, pixel_values], dim=0).squeeze()
start_token_id = model.config.bos_token_id
generated_ids = model.generate(pixel_values=pixel_values, input_ids=torch.tensor([[start_token_id], [start_token_id]]), max_length=50)
generated_captions = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_captions)
I'll add a corresponding test to make sure this is tested.
System Info
Who can help?
@NielsRogge
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm trying to run image captioning in batches. The easiest way to try that was to change the example for video captioning here. According to source code,
pixel_values
must be in the shape of(batch_size, num_frames, num_channels, height, width)
or(batch_size, num_channels, height, width)
. But reshaping thepixel_values
from the example to turn video captioning into batch image captioning throws the following error:RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 6 but got size 1 for tensor number 1 in the list.
during
hidden_states = torch.cat((projected_visual_features, embedding_output), dim=1)
(line 1268 of modeling_git.py).Here is the code for reproducibility:
Expected behavior
Expected it to generate ids per image in the batch.