Closed rstebbing closed 1 year ago
Hi @rstebbing,
Indeed, this is a pretty tricky issue. You're understanding of the image processor and model matches mine :)
It seems that the effect of batch size is something the authors were aware of: https://github.com/facebookresearch/detr#evaluation, although they don't specify why e.g. the influence of layer norm.
cc @rafaelpadilla Who has also been investing some of the influences of batch size on object detection metrics and came across the same issue.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I'm surprised to see this closed, but also appreciate the resolution isn't super straightforward.
System Info
transformers
version: 4.27.4Who can help?
@amyeroberts @NielsRogge
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I have been experimenting with
DetrForObjectDetection
and discovered an issue where the model output for a given image depends on the aspect ratio of the other images in the batch.A reproducible example is given below:
The issue is the last line: the output of the last layer of the encoder is different for the first image in the batch.
Here is my understanding so far of how the issue arises:
image_processor
resizes all images to be as large as possible, subject to the shortest edge being less than or equal to800
and the longest edge being less than or equal to1333
.DetrForObjectDetection
and all the way to theDetrEncoder
, which then forwards only the pixel values to the backbone (see here).Conv2D
layer). However, in this case, the backbone has batch normalization layers that add values too. The result of this is that the padding pixels get non-zero values which then influence downstream convolutions.Expected behavior
If two images are included in a single batch, the model output should be identical to as if the two images were evaluated in separate batches of size one.