Open npmhung opened 2 years ago
Thanks for the report @npmhung , I agree this is a bit confusing.
The mask a is equal to
tensor([[[[0., 0., 0., 0., 1., 1., 1., 1.],
[0., 0., 0., 0., 1., 1., 1., 1.],
[0., 0., 0., 0., 1., 1., 1., 1.],
[0., 0., 0., 0., 1., 1., 1., 1.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]]]])
and masks_to_boxes
returns tensor([[4., 0., 7., 3.]])
, which is correct (or is it??).
box_area
returns 9 instead of 16 because the area computation is simply computed as
(boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
When the the box coordinates are indices, the correct way to compute the area would be to add + 1
to both dimensions, but the problem is that this would make the computation incorrect in when the box coordinates are real numbers (which they often are).
The issue is similar with box_convert
.
I'm honestly not sure how to best solve this. Should we decide that when a box is represented as indices, the upper values (x2 and y2) should be offset by 1? I.e. mask_to_boxes
should return 4, 0, 8, 4
instead? This would be consistent with Python indexing, i.e. we could do something like the_box_img = orig_img[x1:x2, y1:y2]
, which is incorrect right now.
Thoughts @fmassa @datumbox?
I agree with that the mask_to_boxes
should return 4, 0, 8, 4
instead. This return is more consistent with the roi_align
ops when the align
flag is True.
Thanks for the ping.
which returns [4, 4, 3, 3] => expected output [4, 4, 4, 4]
I believe there is a typo on this comment. This should have been:
which returns [4, 0, 3, 3] => expected output [4, 0, 4, 4]
I agree with that the mask_to_boxes should return 4, 0, 8, 4 instead.
The caveat to doing this is that the coordinates of the right element might be outside of the image.
The caveat to doing this is that the coordinates of the right element might be outside of the image.
Do you think it's common for users to directly index with the coordinates? I would assume that they would only use patterns with ranges, like the_box_img = orig_img[x1:x2, y1:y2]
, in which case this isn't really a problem
I can't speak for all use-cases but I think it's a reasonable thing to do. On several over the models we even have methods to ensure that the predictions are within the bounds to avoid issues. Look at clip_boxes_to_image()
for example.
My 2 cents, from what I noticed while writing the utlity and tutorial.
I would like to point out another case. PenFudan Dataset assumes even the xmin, ymin along with xmax, ymax to be shifted by 1.
E.g. we convert the boxes to. (Even in the official tutorial)
tensor([[ 96., 134., 181., 417.],
[286., 113., 357., 331.],
[363., 120., 436., 328.]])
While the dataset txt files have
Bounding box for object 1 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (97, 135) - (182, 418)
Bounding box for object 2 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (287, 114) - (358, 332)
Bounding box for object 3 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (364, 121) - (437, 329)
I think it might be better to introduce a parameter called offset and allow users to pass a Tuple[float, float, float, float]
?
Since it is more popular to use this utility as we have it currently, it might not be nice to modify it.
PenFudan Dataset assumes even the xmin, ymin along with xmax, ymax to be shifted by 1. E.g. we convert the boxes to. (Even in the official tutorial)
Do you mind pointing to the place where we do such offsetting? I don't see this being done in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
My point is we don't do but the annotations from original dataset do.
So how do we end up with
tensor([[ 96., 134., 181., 417.],
[286., 113., 357., 331.],
[363., 120., 436., 328.]])
?
Sorry for the confusion @NicolasHug. Let me elaborate.
Let's say we have a bounding box on the 0th pixel (just a dot)
We could write it as [1.0, 1.0. 1.0, 1.0]
Meaning that we are at the 1st element of the image.
Or
We could write it as [0.0, 0.0, 0.0, 0.0]
Meaning we are 0th element of the image.
I think it is matter of 0 indexing or 1 indexing, and hence the offset of 1 between PenFudan Annotations and our masks_to_boxes
.
PenFudan assumed 1 indexed boxes and I think we return 0 indexed boxes from the masks_to_boxes
function.
Hence we are 1 less in every dimension.
As @NicolasHug pointed out using indexes for area calculation is wrong! (We should bump length and breadth by y 1)
Again let's take an example A rectangle as length 3, It has breadth 4 Hence actual area = 12 (3 * 4)
By indexing it has index = 2. By indexing it's index = 3. Indexing area = 6 (2 * 3))
So clearly using indexed to calculate area won't work!
We currently return tensor([[4., 0., 7., 3.]])
which is correct! According to our assumptions.
If we are calculating the width / area based on this 0 based indexing, of course the answer will be wrong :) (As pointed from above) One can't calculate width
/ area
with indices. To make box_convert calculate correctly, we need +1 for xmax and ymax.
Our assumption is a valid one, we assume that we return boxes with 0 indexes.
Should we return with 1 index, we would match PenFudan. The area problem would still remain!
The area calculation don't anyways, coz length an breadth would still remain same. (By 1 based indexing the difference becomes (xmax + 1 - (xmin + 1)) * ((ymax + 1) - (ymax - 1))
If the user wants to calculate the area. He should offset either dimensions by 1 (extend xmax and ymax, which would in turn increase the length and breadth).
Should he like a 1 based indexing, he should increase all the dimensions (xmin, ymin, xmax, ymax) by 1.
Hence I proposed an offset
parameter for masks_to_boxes
which will accommodate both the cases. (again Naming is NP Hard, suggest better names)
Or alternatively a new op
offset_boxes() which would separately handle this indexing issue.
P.S. I too might be possibly wrong. It's kinda confusing.
cc @NicolasHug @datumbox @npmhung
@oke-aditya if the PenFundan dataset uses 1-based indexing, does that mean that the box coordinates in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html are incorrect?
Yes I can confirm that from the colab notebook We are doing 0 based indexing and PenFudan took 1 based.
Image_id 0. We calculated
'boxes': tensor([[159., 181., 301., 430.],
[419., 170., 534., 485.]]), 'image_id': tensor([0]),
[Annotation FudanPed00001.txt…]()
Bounding box for object 1 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (160, 182) - (302, 431)
Bounding box for object 2 "PASpersonWalking" (Xmin, Ymin) - (Xmax, Ymax) : (420, 171) - (535, 486)
(Even the image_id they used 1 based indexing and we used 0 :sweat_smile: )
Sorry, I still don't see where we convert 160 to 159?
I don't get you, we don't convert 160 to 159. (I agree)
The box co-ordinates are mismatched
but not incorrect, as 0 based indexing still works fine to calculate stuff.
Edit: Am I doing something incorrect?
I'm really confused - where did you get the content of the Annotation FudanPed00001.txt
file from?
From the dataset itself. https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
Are we using these annotations at all as ground truth targets?
Nope, we are not using them as ground truth. Edit sorry if penFudan is confusing here. I just wanted to point out another use case
Thanks.
Then I'm not sure this 0 vs 1 indexing discussion is relevant to what we're concerned with here. Python is 0-based and we're following that too.
The problem we have here is the relation between x1 - x0 and y1 - y0, and how it plays out in the area calculation, and other conversions. Those concerns are valid whether one uses 0 or 1-based indexing, but I don't think we should worry about supporting 1-based indexing.
Yes, indexing shouldn't matter and the issue of area will exist in both. (Hopefully users don't want 1 indexing based boxes :crossed_fingers: )
What do you think we are doing wrong? (If it is) Should we be offsetting the values of xmax, ymax by 1?
Yep, box_convert should be extremely clear how it handles width, float coordinates, integer coordinates and maybe provide options / recommendations in the docs
This blog post talks about this at length: https://ppwwyyxx.com/blog/2021/Where-are-Pixels/#Choices-of-Sampling-Grid. I assume the author of the blog post is included and illuminates this discussion :)
We should return area / sizes as x2 - x1
, and not x2 - x1 + 1
as it was previously done in early versions of other libraries (like Detectron).
This was a deliberate decision, and as @rbgirshick pointed out in the past
Anyone interested in this issue should read http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf. I believe that this provides the correct framework for thinking about continuous geometric entities (e.g., boxes, polygons, etc) and discrete rasterized entities (e.g., bit masks, feature maps, etc.).
To remove confusion, it would be nice to have a push to completely merge all these bbox-format related transforms and conversions with detectron2 code and move it all to torchvision
Feature suggestion
When checking the how the new operations in the torchvision 0.11 work, I found an inconsistency between different function.
Suggest a potential alternative/fix
This is not actually an error but could cause confusion.
I suggest to make the box representations between different function more consistent.