GIT batch prediction seems to be broken

roelschr commented 1 year ago

System Info

name         : transformers                                                      
version      : 4.26.0.dev0

Who can help?

@NielsRogge

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I'm trying to run image captioning in batches. The easiest way to try that was to change the example for video captioning here. According to source code, pixel_values must be in the shape of (batch_size, num_frames, num_channels, height, width) or (batch_size, num_channels, height, width). But reshaping the pixel_values from the example to turn video captioning into batch image captioning throws the following error:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 6 but got size 1 for tensor number 1 in the list.

during hidden_states = torch.cat((projected_visual_features, embedding_output), dim=1) (line 1268 of modeling_git.py).

Here is the code for reproducibility:

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import numpy as np
from huggingface_hub import hf_hub_download
from decord import VideoReader, cpu

processor = AutoProcessor.from_pretrained("microsoft/git-base-vatex")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-vatex")

# set seed for reproducability
np.random.seed(45)

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

def sample_frames(file_path, num_frames):
    videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))
    videoreader.seek(0)
    indices = sample_frame_indices(clip_len=num_frames, frame_sample_rate=4, seg_len=len(videoreader))
    frames = videoreader.get_batch(indices).asnumpy()
    return list(frames)

# load video
file_path = hf_hub_download(
    repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
)

# sample frames
num_frames = model.config.num_image_with_embedding
print(num_frames)
frames = sample_frames(file_path, num_frames)

# pixel_values = processor(images=frames, return_tensors="pt").pixel_values.reshape((num_frames, 1, 3, 224, 224))
pixel_values = processor(images=frames, return_tensors="pt").pixel_values.reshape((num_frames, 3, 224, 224))
print(pixel_values.size())
generated_ids = model.generate(pixel_values=pixel_values, max_length=50)

print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True))

Expected behavior

Expected it to generate ids per image in the batch.

NielsRogge commented 1 year ago

Hi,

If you want to run inference on individual frames, you'll need to use a model that expects individual frames, not videos.

Here you're loading microsoft/git-base-vatex, hence, it expects pixel_values of shape (batch_size, num_frames, num_channels, height, width).

To run inference on a batch of images, you can use models which are trained on image captioning datasets, like microsoft/git-base, microsoft/git-base-coco, microsoft/git-base-textcaps (as well as any of the large variants).

Edit; after investigating it still seems like there's an error. Looking into this

roelschr commented 1 year ago

Sorry, I should have also shown that it doesn't work on captioning models (even though I have tested it on my side, both git-base-coco and git-large-coco). My bad!

I appreciate that you're looking into this 🙏

NielsRogge commented 1 year ago

For some reason Github isn't automatically linking the PR that will fix it: #21071

NielsRogge commented 1 year ago

Update: seems that the PR above doesn't fix it. So issue remains open

NielsRogge commented 1 year ago

Ok figured this out! The problem is that you're not passing input_ids of the same batch size. By default, the generate method will just use the start token ID (which for GIT equals model.config.bos_token_id = 101). However when sending a batch of images through the model, we also need to prepare a batch of start tokens.

The following works:

from transformers import AutoProcessor, AutoModelForCausalLM

import requests
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("microsoft/git-base-coco")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-coco")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

pixel_values = processor(images=image, return_tensors="pt").pixel_values

pixel_values = torch.stack([pixel_values, pixel_values], dim=0).squeeze()

start_token_id = model.config.bos_token_id

generated_ids = model.generate(pixel_values=pixel_values, input_ids=torch.tensor([[start_token_id], [start_token_id]]), max_length=50)
generated_captions = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_captions)

I'll add a corresponding test to make sure this is tested.

huggingface / transformers