Open fdwr opened 3 years ago
I will investigate this
Thanks Hari. Note that the coordinates of RoiAlign are on an infinite floating point grid, unlike RoiPool (with integers and that weirdo -1 size adjustment). So a ROI (x1,y1,x2,y2) of [1.6, 1.3, 3.3, 2.75] means a region size of width=1.7 (3.3 - 1.6) and height=1.45 (2.75 - 1.3).
Added some diagrams in the above table, with blue as the input image and orange as the regions to write to output (the dots are the blue input sample points and orange output points).
Hi Dwayne,
I investigated this and here are my findings:
1) In your table above, this doesn't seem right:
torchvision.ops.roi_align(aligned=False…)*deprecated, legacy flag still exists | [13.75, 14.25, 14.75, 15.25],[18.75, 19.25, 19.75, 20.25],[23.75, 24.25, 24.75, 25.25],[28.75, 29.25, 29.75, 30.25] | |
---|---|---|
??? 🙃 | ONNX Runtime 1.7 CPU EP RoiAlign | [6.1875, 6.75, 6.75, 7.3125],[11.8125, 12.375, 12.375, 12.9375],[11.8125, 12.375, 12.375, 12.9375],[17.4375, 18, 18, 18.5625] |
The reason is that these results are produced by ORT when the pooling mode is 'max' for the operator. From the looks of it, Torch only seems to support 'avg' pooling. As soon as you change the mode to 'avg', you will see that the results from ORT are matching TorchVision (aligned = False).
So, my conclusion is ORT's CPU implementation (in avg mode) == Legacy Torch ROIAlign (aligned = False)
2) To ensure backwards compatibility, TorchVision seems to have introduced the aligned
flag and gives the user the option of "fixing the misalignment"(the legacy aligned = False is still the default mode though). In the Detectron project, the RoiAlign wrapper they have has the default value for the 'aligned' flag set to True and this nuance is seemingly the root cause for the diffs wrt to ORT. So, to summarize, ORT's CPU backend implements the legacy mode and DirectML backend (like Detectron) is implementing the "misalignment fixed" logic. Does this make sense?
Unfortunately, I don't know how to fix this without breaking backwards compatibility. The way I see it, we may have to introduce a new attribute in ONNX for this op (just like TorchVision introduced the aligned
flag) that would allow the user to pick which implementation they would like. What do you think ?
I fixed the table above and wording showing it matches deprecated PyTorch behavior (oops, sorry for the "max" vs "avg" mixup when I recorded that result 😅). Yeah, back compat is a concern. Really ONNX should add an attribute to RoiAlign like it did with Resize and default to half_pixels (ONNX convertors from old opsets can set it to the legacy behavior). Thanks for investigating.
I'll plot the output rectangles from both approaches over the original image to contrast them. I've opened an ONNX issue.
Describe the bug The RoiAlign operator, per the Mask RCNN paper and Facebook Research's Detectron 2 implementation aligns sampling points over the center of the pixels, but ORT's CPU implementation is misaligned by a half pixel. After comparing ORT to various references (table below), I see current ORT code duplicated PyTorch's earlier bug in roi_align which applied an offset the output subsample by 0.5 but forgot to adjust the input sample to compensate (see their comment in the code: "_the original roialign (aligned=False) does not subtract the 0.5 when computing neighboring pixel indices and therefore it uses pixels with a slightly incorrect alignment (relative to our pixel model) when performing bilinear interpolation").
From the paper, note pixel centers used for interpolation:
This isn't as evident for larger input image regions, where that misalignment becomes less important relative to the overall region size, but it makes quite a difference for smaller regions. Even identity cases are misaligned (where the region of interest exactly matches the output tensor size). e.g. Taking the middle 2x2 slice of a 4x4 input to a 2x2 output (integer coordinates, no scale factor) should yield exactly that input slice, but ORT's result are shifted half a pixel off.
Urgency No deadline.
System information
To Reproduce
Expected behavior
[[[[11, 12], [21, 22]]]]
[[[[5.50, 5.75], [8.00, 8.25]]]]
[[[[ 8.25, 8.75, 9.25, 9.75], [13.25, 13.75, 14.25, 14.75], [18.25, 18.75, 19.25, 19.75], [23.25, 23.75, 24.25, 24.75]]]]
[[[[6.1875, 6.75, 6.75, 7.3125], [11.8125, 12.375, 12.375, 12.9375], [11.8125, 12.375, 12.375, 12.9375], [17.4375, 18, 18, 18.5625]]]]
Screenshots e.g.
Additional context
This affects the faster_rcnn and mask_rcnn models in WinML, for which the expected output results appear to have been recorded using the incorrect alignment via CPU in the first place, whereas DML follows half pixel alignment (matching Detectron 2) and gets different results than the output .PB files.
For an example case (modified from the Detectron test case), and comparison to other framework results:
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
coordinate_transformation_mode=half
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
torchvision.ops.roi_align(aligned=True…)
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
torchvision.ops.roi_align(aligned=False…)
*deprecated, legacy flag still exists
[18.75, 19.25, 19.75, 20.25],
[23.75, 24.25, 24.75, 25.25],
[28.75, 29.25, 29.75, 30.25]
[18.75, 19.25, 19.75, 20.25],
[23.75, 24.25, 24.75, 25.25],
[28.75, 29.25, 29.75, 30.25]
tf.image.crop_and_resize(…)
*Note boxes are normalized 0 to 1 (so /5 each ROI element)
[17.66, 18.33, 19.00, 19.66],
[24.33, 25.00, 25.66, 26.33],
[31.00, 31.66, 32.33, 33.00]
tf.image.resize_bilinear(align_corners=True…)
+
tf.slice
[17.66, 18.33, 19.00, 19.66],
[24.33, 25.00, 25.66, 26.33],
[31.00, 31.66, 32.33, 33.00]
tf.image.resize_bilinear(align_corners=False…)
+
tf.slice
[16.00, 16.50, 17.00, 17.50],
[21.00, 21.50, 22.00, 22.50],
[26.00, 26.50, 27.00, 27.50]
tf.image.resize_bilinear(half_pixel_centers=True…)
+
tf.slice
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
torch.nn.functional.interpolate
tf.keras.layers.UpSampling2D
Even the ONNX backend conformance test case has these misaligned numbers: https://github.com/onnx/onnx/blob/master/onnx/backend/test/case/node/roialign.py
PyTorch sample code:
TensorFlow sample code:
Facebook research's Detectron 2 test code: