Model outputs are impacted by the aspect ratios of other images in a batch

rstebbing commented 1 year ago

System Info

transformers version: 4.27.4
Platform: macOS-13.3.1-arm64-arm-64bit
Python version: 3.10.11
Huggingface_hub version: 0.13.3
PyTorch version (GPU?): 1.13.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@amyeroberts @NielsRogge

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I have been experimenting with DetrForObjectDetection and discovered an issue where the model output for a given image depends on the aspect ratio of the other images in the batch.

A reproducible example is given below:

import io

import requests
import torch
from PIL import Image
from transformers import DetrForObjectDetection, DetrImageProcessor

def main():
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    print(f"{url = }")

    with requests.Session() as session:
        image_bytes = session.get(url).content

    image = Image.open(io.BytesIO(image_bytes))
    print(f"{image.size = }")

    pretrained_model_name = "facebook/detr-resnet-50"
    print(f"{pretrained_model_name = }")

    image_processor = DetrImageProcessor.from_pretrained(pretrained_model_name)
    assert isinstance(image_processor, DetrImageProcessor)
    model = DetrForObjectDetection.from_pretrained(pretrained_model_name)
    assert isinstance(model, DetrForObjectDetection)

    for images_expr, images in [
        (
            "[image]",
            [image],
        ),
        (
            "[image, image]",
            [image, image],
        ),
        (
            "[image, image.resize((image.width, image.height * 2))]",
            [image, image.resize((image.width, image.height * 2))],
        ),
    ]:
        print(f"images = {images_expr}")

        inputs = image_processor(images=images, return_tensors="pt")
        assert sorted(inputs) == ["pixel_mask", "pixel_values"]
        pixel_mask, pixel_values = inputs["pixel_mask"], inputs["pixel_values"]
        print(f"  {pixel_mask.shape = }, {pixel_values.shape = }")

        with torch.no_grad():
            outputs = model(
                pixel_mask=pixel_mask,
                pixel_values=pixel_values,
            )

        print(f"  {outputs.encoder_last_hidden_state.shape = }")
        print(f"  {outputs.encoder_last_hidden_state[0, 0, :8] = }")

if __name__ == "__main__":
    main()

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image.size = (640, 480)
pretrained_model_name = 'facebook/detr-resnet-50'
images = [image]
  pixel_mask.shape = torch.Size([1, 800, 1066]), pixel_values.shape = torch.Size([1, 3, 800, 1066])
  outputs.encoder_last_hidden_state.shape = torch.Size([1, 850, 256])
  outputs.encoder_last_hidden_state[0, 0, :8] = tensor([-0.0544, -0.0425, -0.0307, -0.0107,  0.0201, -0.1194,  0.0373,  0.0250])
images = [image, image]
  pixel_mask.shape = torch.Size([2, 800, 1066]), pixel_values.shape = torch.Size([2, 3, 800, 1066])
  outputs.encoder_last_hidden_state.shape = torch.Size([2, 850, 256])
  outputs.encoder_last_hidden_state[0, 0, :8] = tensor([-0.0544, -0.0425, -0.0307, -0.0107,  0.0201, -0.1194,  0.0373,  0.0250])
images = [image, image.resize((image.width, image.height * 2))]
  pixel_mask.shape = torch.Size([2, 1200, 1066]), pixel_values.shape = torch.Size([2, 3, 1200, 1066])
  outputs.encoder_last_hidden_state.shape = torch.Size([2, 1292, 256])
  outputs.encoder_last_hidden_state[0, 0, :8] = tensor([-0.0399, -0.0472, -0.0268, -0.0136,  0.0196, -0.1215,  0.0678,  0.0230])

The issue is the last line: the output of the last layer of the encoder is different for the first image in the batch.

Here is my understanding so far of how the issue arises:

The image_processor resizes all images to be as large as possible, subject to the shortest edge being less than or equal to 800 and the longest edge being less than or equal to 1333.
To combine images of different aspect ratios in the same batch, images are padded with zeros at the bottom and right.
The pixel values and pixel mask are forwarded through DetrForObjectDetection and all the way to the DetrEncoder, which then forwards only the pixel values to the backbone (see here).
If an image is padded with zeros then it is OK to omit the pixel mask if zeros are preserved by the layers (e.g. a Conv2D layer). However, in this case, the backbone has batch normalization layers that add values too. The result of this is that the padding pixels get non-zero values which then influence downstream convolutions.

Expected behavior

If two images are included in a single batch, the model output should be identical to as if the two images were evaluated in separate batches of size one.

amyeroberts commented 1 year ago

Hi @rstebbing,

Indeed, this is a pretty tricky issue. You're understanding of the image processor and model matches mine :)

It seems that the effect of batch size is something the authors were aware of: https://github.com/facebookresearch/detr#evaluation, although they don't specify why e.g. the influence of layer norm.

cc @rafaelpadilla Who has also been investing some of the influences of batch size on object detection metrics and came across the same issue.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

rstebbing commented 1 year ago

I'm surprised to see this closed, but also appreciate the resolution isn't super straightforward.

huggingface / transformers