pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.11k stars 6.94k forks source link

Inconsistent representation of box in torchvision.ops #4745

Open npmhung opened 2 years ago

npmhung commented 2 years ago

Feature suggestion

When checking the how the new operations in the torchvision 0.11 work, I found an inconsistency between different function.

import torchvision.ops as tops
import torch
import torch.nn as nn

a = torch.zeros(1, 1, 8, 8)
a[..., :4, 4:] = 1
bbox = tops.masks_to_boxes((a==1)[0].float()) 
area = tops.box_area(bbox) # the returned area is 9 which should be 16

new_box = tops.box_convert(bbox, 'xyxy', 'xywh') # which returns [4, 4, 3, 3] => expected output [4, 4, 4, 4]

Suggest a potential alternative/fix

This is not actually an error but could cause confusion.

I suggest to make the box representations between different function more consistent.

NicolasHug commented 2 years ago

Thanks for the report @npmhung , I agree this is a bit confusing.

The mask a is equal to

tensor([[[[0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.]]]])

and masks_to_boxes returns tensor([[4., 0., 7., 3.]]), which is correct (or is it??).

box_area returns 9 instead of 16 because the area computation is simply computed as

 (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])

When the the box coordinates are indices, the correct way to compute the area would be to add + 1 to both dimensions, but the problem is that this would make the computation incorrect in when the box coordinates are real numbers (which they often are).

The issue is similar with box_convert.

I'm honestly not sure how to best solve this. Should we decide that when a box is represented as indices, the upper values (x2 and y2) should be offset by 1? I.e. mask_to_boxes should return 4, 0, 8, 4 instead? This would be consistent with Python indexing, i.e. we could do something like the_box_img = orig_img[x1:x2, y1:y2], which is incorrect right now.

Thoughts @fmassa @datumbox?

npmhung commented 2 years ago

I agree with that the mask_to_boxes should return 4, 0, 8, 4 instead. This return is more consistent with the roi_align ops when the align flag is True.

datumbox commented 2 years ago

Thanks for the ping.

which returns [4, 4, 3, 3] => expected output [4, 4, 4, 4]

I believe there is a typo on this comment. This should have been:

which returns [4, 0, 3, 3] => expected output [4, 0, 4, 4]

I agree with that the mask_to_boxes should return 4, 0, 8, 4 instead.

The caveat to doing this is that the coordinates of the right element might be outside of the image.

NicolasHug commented 2 years ago

The caveat to doing this is that the coordinates of the right element might be outside of the image.

Do you think it's common for users to directly index with the coordinates? I would assume that they would only use patterns with ranges, like the_box_img = orig_img[x1:x2, y1:y2], in which case this isn't really a problem

datumbox commented 2 years ago

I can't speak for all use-cases but I think it's a reasonable thing to do. On several over the models we even have methods to ensure that the predictions are within the bounds to avoid issues. Look at clip_boxes_to_image() for example.

oke-aditya commented 2 years ago

My 2 cents, from what I noticed while writing the utlity and tutorial.

I would like to point out another case. PenFudan Dataset assumes even the xmin, ymin along with xmax, ymax to be shifted by 1.

E.g. we convert the boxes to. (Even in the official tutorial)

tensor([[ 96., 134., 181., 417.],
        [286., 113., 357., 331.],
        [363., 120., 436., 328.]])

While the dataset txt files have

Bounding box for object 1 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (97, 135) - (182, 418)

Bounding box for object 2 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (287, 114) - (358, 332)

Bounding box for object 3 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (364, 121) - (437, 329)

I think it might be better to introduce a parameter called offset and allow users to pass a Tuple[float, float, float, float] ? Since it is more popular to use this utility as we have it currently, it might not be nice to modify it.

NicolasHug commented 2 years ago

PenFudan Dataset assumes even the xmin, ymin along with xmax, ymax to be shifted by 1. E.g. we convert the boxes to. (Even in the official tutorial)

Do you mind pointing to the place where we do such offsetting? I don't see this being done in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

oke-aditya commented 2 years ago

My point is we don't do but the annotations from original dataset do.

FudanPed00054.txt

NicolasHug commented 2 years ago

So how do we end up with

tensor([[ 96., 134., 181., 417.],
        [286., 113., 357., 331.],
        [363., 120., 436., 328.]])

?

oke-aditya commented 2 years ago

Sorry for the confusion @NicolasHug. Let me elaborate.

PenFudan vs Our Calculations

Let's say we have a bounding box on the 0th pixel (just a dot)

We could write it as [1.0, 1.0. 1.0, 1.0] Meaning that we are at the 1st element of the image. Or We could write it as [0.0, 0.0, 0.0, 0.0] Meaning we are 0th element of the image.

I think it is matter of 0 indexing or 1 indexing, and hence the offset of 1 between PenFudan Annotations and our masks_to_boxes.

PenFudan assumed 1 indexed boxes and I think we return 0 indexed boxes from the masks_to_boxes function. Hence we are 1 less in every dimension.

Using indexes for area ?

As @NicolasHug pointed out using indexes for area calculation is wrong! (We should bump length and breadth by y 1)

Again let's take an example A rectangle as length 3, It has breadth 4 Hence actual area = 12 (3 * 4)

By indexing it has index = 2. By indexing it's index = 3. Indexing area = 6 (2 * 3))

So clearly using indexed to calculate area won't work!

Linking back to the above problem.

We currently return tensor([[4., 0., 7., 3.]]) which is correct! According to our assumptions.

If we are calculating the width / area based on this 0 based indexing, of course the answer will be wrong :) (As pointed from above) One can't calculate width / area with indices. To make box_convert calculate correctly, we need +1 for xmax and ymax.

So is our assumption valid ?

Our assumption is a valid one, we assume that we return boxes with 0 indexes.

Should we return with 1 index, we would match PenFudan. The area problem would still remain! The area calculation don't anyways, coz length an breadth would still remain same. (By 1 based indexing the difference becomes (xmax + 1 - (xmin + 1)) * ((ymax + 1) - (ymax - 1))

If the user wants to calculate the area. He should offset either dimensions by 1 (extend xmax and ymax, which would in turn increase the length and breadth).

Should he like a 1 based indexing, he should increase all the dimensions (xmin, ymin, xmax, ymax) by 1.

A simple solution

Hence I proposed an offset parameter for masks_to_boxes which will accommodate both the cases. (again Naming is NP Hard, suggest better names)

Or alternatively a new op

offset_boxes() which would separately handle this indexing issue.

P.S. I too might be possibly wrong. It's kinda confusing.

cc @NicolasHug @datumbox @npmhung

NicolasHug commented 2 years ago

@oke-aditya if the PenFundan dataset uses 1-based indexing, does that mean that the box coordinates in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html are incorrect?

oke-aditya commented 2 years ago

Yes I can confirm that from the colab notebook We are doing 0 based indexing and PenFudan took 1 based.

Image_id 0. We calculated

'boxes': tensor([[159., 181., 301., 430.],
          [419., 170., 534., 485.]]), 'image_id': tensor([0]), 

[Annotation FudanPed00001.txt…]()

Bounding box for object 1 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (160, 182) - (302, 431)
Bounding box for object 2 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (420, 171) - (535, 486)

(Even the image_id they used 1 based indexing and we used 0 :sweat_smile: )

NicolasHug commented 2 years ago

Sorry, I still don't see where we convert 160 to 159?

oke-aditya commented 2 years ago

I don't get you, we don't convert 160 to 159. (I agree) The box co-ordinates are mismatched but not incorrect, as 0 based indexing still works fine to calculate stuff.

Edit: Am I doing something incorrect?

NicolasHug commented 2 years ago

I'm really confused - where did you get the content of the Annotation FudanPed00001.txt file from?

oke-aditya commented 2 years ago

From the dataset itself. https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip

NicolasHug commented 2 years ago

Are we using these annotations at all as ground truth targets?

oke-aditya commented 2 years ago

Nope, we are not using them as ground truth. Edit sorry if penFudan is confusing here. I just wanted to point out another use case

NicolasHug commented 2 years ago

Thanks.

Then I'm not sure this 0 vs 1 indexing discussion is relevant to what we're concerned with here. Python is 0-based and we're following that too.

The problem we have here is the relation between x1 - x0 and y1 - y0, and how it plays out in the area calculation, and other conversions. Those concerns are valid whether one uses 0 or 1-based indexing, but I don't think we should worry about supporting 1-based indexing.

oke-aditya commented 2 years ago

Yes, indexing shouldn't matter and the issue of area will exist in both. (Hopefully users don't want 1 indexing based boxes :crossed_fingers: )

What do you think we are doing wrong? (If it is) Should we be offsetting the values of xmax, ymax by 1?

vadimkantorov commented 2 years ago

Yep, box_convert should be extremely clear how it handles width, float coordinates, integer coordinates and maybe provide options / recommendations in the docs

This blog post talks about this at length: https://ppwwyyxx.com/blog/2021/Where-are-Pixels/#Choices-of-Sampling-Grid. I assume the author of the blog post is included and illuminates this discussion :)

fmassa commented 2 years ago

We should return area / sizes as x2 - x1, and not x2 - x1 + 1 as it was previously done in early versions of other libraries (like Detectron).

This was a deliberate decision, and as @rbgirshick pointed out in the past

Anyone interested in this issue should read http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf. I believe that this provides the correct framework for thinking about continuous geometric entities (e.g., boxes, polygons, etc) and discrete rasterized entities (e.g., bit masks, feature maps, etc.).

vadimkantorov commented 2 years ago

To remove confusion, it would be nice to have a push to completely merge all these bbox-format related transforms and conversions with detectron2 code and move it all to torchvision