huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.48k stars 26.89k forks source link

Model outputs are impacted by the aspect ratios of other images in a batch #23218

Closed rstebbing closed 1 year ago

rstebbing commented 1 year ago

System Info

Who can help?

@amyeroberts @NielsRogge

Information

Tasks

Reproduction

I have been experimenting with DetrForObjectDetection and discovered an issue where the model output for a given image depends on the aspect ratio of the other images in the batch.

A reproducible example is given below:

import io

import requests
import torch
from PIL import Image
from transformers import DetrForObjectDetection, DetrImageProcessor

def main():
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    print(f"{url = }")

    with requests.Session() as session:
        image_bytes = session.get(url).content

    image = Image.open(io.BytesIO(image_bytes))
    print(f"{image.size = }")

    pretrained_model_name = "facebook/detr-resnet-50"
    print(f"{pretrained_model_name = }")

    image_processor = DetrImageProcessor.from_pretrained(pretrained_model_name)
    assert isinstance(image_processor, DetrImageProcessor)
    model = DetrForObjectDetection.from_pretrained(pretrained_model_name)
    assert isinstance(model, DetrForObjectDetection)

    for images_expr, images in [
        (
            "[image]",
            [image],
        ),
        (
            "[image, image]",
            [image, image],
        ),
        (
            "[image, image.resize((image.width, image.height * 2))]",
            [image, image.resize((image.width, image.height * 2))],
        ),
    ]:
        print(f"images = {images_expr}")

        inputs = image_processor(images=images, return_tensors="pt")
        assert sorted(inputs) == ["pixel_mask", "pixel_values"]
        pixel_mask, pixel_values = inputs["pixel_mask"], inputs["pixel_values"]
        print(f"  {pixel_mask.shape = }, {pixel_values.shape = }")

        with torch.no_grad():
            outputs = model(
                pixel_mask=pixel_mask,
                pixel_values=pixel_values,
            )

        print(f"  {outputs.encoder_last_hidden_state.shape = }")
        print(f"  {outputs.encoder_last_hidden_state[0, 0, :8] = }")

if __name__ == "__main__":
    main()
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image.size = (640, 480)
pretrained_model_name = 'facebook/detr-resnet-50'
images = [image]
  pixel_mask.shape = torch.Size([1, 800, 1066]), pixel_values.shape = torch.Size([1, 3, 800, 1066])
  outputs.encoder_last_hidden_state.shape = torch.Size([1, 850, 256])
  outputs.encoder_last_hidden_state[0, 0, :8] = tensor([-0.0544, -0.0425, -0.0307, -0.0107,  0.0201, -0.1194,  0.0373,  0.0250])
images = [image, image]
  pixel_mask.shape = torch.Size([2, 800, 1066]), pixel_values.shape = torch.Size([2, 3, 800, 1066])
  outputs.encoder_last_hidden_state.shape = torch.Size([2, 850, 256])
  outputs.encoder_last_hidden_state[0, 0, :8] = tensor([-0.0544, -0.0425, -0.0307, -0.0107,  0.0201, -0.1194,  0.0373,  0.0250])
images = [image, image.resize((image.width, image.height * 2))]
  pixel_mask.shape = torch.Size([2, 1200, 1066]), pixel_values.shape = torch.Size([2, 3, 1200, 1066])
  outputs.encoder_last_hidden_state.shape = torch.Size([2, 1292, 256])
  outputs.encoder_last_hidden_state[0, 0, :8] = tensor([-0.0399, -0.0472, -0.0268, -0.0136,  0.0196, -0.1215,  0.0678,  0.0230])

The issue is the last line: the output of the last layer of the encoder is different for the first image in the batch.

Here is my understanding so far of how the issue arises:

Expected behavior

If two images are included in a single batch, the model output should be identical to as if the two images were evaluated in separate batches of size one.

amyeroberts commented 1 year ago

Hi @rstebbing,

Indeed, this is a pretty tricky issue. You're understanding of the image processor and model matches mine :)

It seems that the effect of batch size is something the authors were aware of: https://github.com/facebookresearch/detr#evaluation, although they don't specify why e.g. the influence of layer norm.

cc @rafaelpadilla Who has also been investing some of the influences of batch size on object detection metrics and came across the same issue.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

rstebbing commented 1 year ago

I'm surprised to see this closed, but also appreciate the resolution isn't super straightforward.