On the correctness of inference of positions

ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

https://docs.ultralytics.com

GNU Affero General Public License v3.0

50.45k stars 16.28k forks source link

On the correctness of inference of positions #368

Closed jerry73204 closed 4 years ago

jerry73204 commented 4 years ago

🐛 Bug

I noticed the line has a potential error shown as the following.

y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy

If I understand it correctly, self.stride[i] can be regarded as grid width in pixels, and self.grid[i] can be regarded as enumerated (x, y) in units of grids.

The term (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) can be seen as a compensation on positions in grid units. That is, it computes (x + Δx, y + Δy) in grid units. Then, by multiplying self.stride[i], it turns to pixel units and is saved to y[..., 0:2].

It made me wonder why the offset term y[..., 0:2] * 2. - 0.5 is chosen to be asymmetric. The term y[..., 0:2] came from sigmoid, thus y[..., 0:2] * 2. - 0.5 has range [-0.5, 1.5]. It means the offset is not centered at zero.

Expected behavior

I expect the formula to be

y[..., 0:2] = (y[..., 0:2] * 2. - 1.0 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy

Environment

OS: Arch Linux
GPU 2080 Ti

github-actions[bot] commented 4 years ago

Hello @jerry73204, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

@jerry73204 these regression ranges span a grid (0.0 - 1.0) symmetrically, they are mean 0.5, not mean 0.0.

jerry73204 commented 4 years ago

Got it. Thanks.

glenn-jocher commented 4 years ago

The way to think about it is that while a position inside the grid is defined with respect to the 0,0 origin, the actual receptive field of the grid cell is centered at 0.5, 0.5. A neuron outputting zero will be producing a regression output at the exact center of the grid cell since sigmoid(0) = 0.5

jerry73204 commented 4 years ago

Yes. As you pointed out, I realized * 2. - 0.5 expands the range but but is still centered at 0.5.

I'd like to ask another question. I see the build_target in loss function (code) scales target by gain. Look like it scales the sizes and positions from ratios to units of grids. Does it mean the GIoU is computed in grid units?

If yes, suppose the line calculates the width/height ratio by dividing target sizes by anchor sizes both in units of grids? I think it's relevant to the place dividing the anchor sizes in anticipation. If I think it correctly, we could distinguish var names for the anchors in pixels and anchors in grids to avoid confusion.

glenn-jocher commented 4 years ago

@jerry73204 yes I think this all make sense. The anchors at the moment are named for their use, so anchor_grid are applied to the grid during inference etc.

The units of the operations are not mathematically important, the wh ratio and the GIoU can be calculated in any non-normalized units, they are implemented as is for computational efficiency.

glenn-jocher commented 4 years ago

BTW, the anchor values themselves are mostly the same output normalization (i.e. to unity variance and zero mean) that ML models have been using for decades, the main innovation here is the use of multiple anchors per grid cell. AutoAnchor will analyze any supplied anchors for suitability in combination with your supplied dataset, and recompute and integrate new anchors automatically before training starts.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

daikankan commented 3 years ago

seem like the regression target of center is normalized by grid units (double of stride of the current feature map)，and the regression target of size is normalized by pre-defined anchors size, right? @jerry73204 @glenn-jocher

glenn-jocher commented 3 years ago

Yes.