ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.33k stars 16.25k forks source link

How/why are anchor boxes are chosen or not cohsen to #12196

Closed Ludaman closed 10 months ago

Ludaman commented 1 year ago

Search before asking

Question

Short form question: How does YoloV5 choose which neurons in the last feature map get the loss applied to them and which do not, or phrased a little differently, which anchor boxes get the loss applied and which do not?

It gets conceptually foggy when you consider using CIoU, trying to pick/choose which object of many should be associated to an anchor box, and the fact receptive field means an anchor box may not know an object even exists on the other side of the image. Any tips on reading to better understand is appreciated.

Longer form question: This is really a two part question, I have partial answers to both parts. The first is how an anchor box or neuron is chosen to get loss applied to it? I think I have a pretty good grasp of this summarized below. The second is how a neuron or anchor box is chosen to NOT have an object associated with it? This one is more complicated, I attempted to answer it below. The complication occurs because of how CIoU functions, CIoU is great for helping non-overlapping Bboxes learn to overlap, but the defaults I see for iou_t training progression never allow non-overlapping Bboxes to activate and take responsibility for an object, hence never learn to predict something the original anchor box didn't overlap.

Background: I am trying to do some research into predicting off-image bounding boxes of a larger rigid object, ie a plane is in the image, but its wing is off the edge of the image, how accurately can I predict the pixel the wingtip would be at if the image were larger. So, how the bounding boxes are chosen to apply loss to or not apply loss is indicative of what is even possible for detecting items off the edge of the screen. I do have plans to test this regardless of the way loss is applied in YOLOv5, but still want to thoroughly understand it.

How I think Loss is applied and objects are chosen/matched to anchor boxes: The existing model predicts on an image and the anchor boxes which predict an object are added to a list. That list then compares the IoU versus the truth for each item. If the IoU exceeds the threshold determined by the training plan, then it is good guess. Good guess means that the anchor box should be assigned to that object because it overlaps enough. If more than one truth box has IoU with an anchor box, the best one is chosen for loss to be calculated for. If the IoU is lower than the threshold then it’s a bad guess, i.e. assuming the anchor box is not appropriate for the feature being detected. An anchor box may have a low IoU and be inappropriate if it is far away from that anchor box in the image and doesn’t overlap it at all.

Essentially, I think an object truth Bbox in an image has to have an anchor box which just so happens to overlap it with sufficient IoU early on in training to activate that anchor, if no anchor ever just so happens to activate then no part of the network assumes responsibility for predicting that object and hence never learns to predict that object. This isnt necessarily a bad thing, it is simply how anchor boxes

Details on how a neuron determines no truth Bbox should be associated with it: The hard part to understand here is the fact an anchor box for a neuron in the last layer of a feature map may not have a positive activation at all. This makes sense because a neuron in the top left of an image will not have the receptive field to know about an object in the bottom right, thus it shouldn’t be trained to try and guess that an object is really far away, so far away the receptive field doesn’t even allow knowledge from that area to get to the neuron making the prediction. BUT, the YOLOv5 code base seems to show that CIoU is used and a training plan for IoU looks like this: iout_t (0, 0.1, 0.7). The issue here is that CIoU is used, usually this would mean that a non-overlapping Bbox would still have gradient information applied to it. However that training plan uses 0 as the starting IoU, meaning that if the CIoU is less than 0 (and it can be with CIoU, likely will be if there is low or no overlap), then the guess is determined to be bad, hence if it did activate then loss is applied to urge it to not activate, but if the anchor box doesn’t have IoU with nearby objects it doesn’t get any loss applied to it.

It is hard to wrap my head around why YOLOv5 chose to have an iou_t starting out at 0. this is confusing because an iou_t of 0 will determine all non-overlapping boxes as bad guesses and punish them if that anchor box tried to activate with no overlap, however the big point of CIoU/GIoU is to migrate a Bbox to overlap another. As I said earlier, a neuron in one part of an image may have ZERO knowledge of a distant object in the image due to its receptive field, hence shouldn’t be punished. So, you do not necessarily want to allow negative values of iou_t because it could confuse a network…but then why even use CIoU and not just use IoU?

Ending Thoughts: With all of this I do not necessarily see anything inherently bad, its simply how one must program an actual implementation of anchor boxes with IoU when you consider the fact receptive field exists and I wanted to type this all to see how far off my understanding is. But I do wonder why CIoU was used with the fact that you basically have to have a feature map neuron's anchor boxes falling on top of a truth Bbox to allow it to take responsibility for that object. Why not just use IoU? I think CIoU also gives the benefits of emphasizing aspect ratio and being more stable.

Additional

No response

glenn-jocher commented 1 year ago

@Ludaman anchor boxes in YOLOv5 are chosen based on the IoU (Intersection over Union) metric. During training, the model predicts anchor boxes for each object in the image. These anchor boxes are then compared to the ground truth bounding boxes using IoU. If the IoU exceeds the threshold set in the training plan, the anchor box is considered a good match and loss is applied to it. Otherwise, if the IoU is lower than the threshold, the anchor box is not considered a good match and no loss is applied.

Regarding your question about how a neuron or anchor box is chosen to not have an object associated with it, it is important to consider the receptive field of the neuron. A neuron in one part of the image may not have knowledge of a distant object due to its limited receptive field. Therefore, it should not be trained to predict that object. The choice of using CIoU instead of IoU helps with emphasizing aspect ratio and providing more stable predictions.

It is worth noting that the YOLOv5 implementation, including the criteria for choosing anchor boxes, is a result of continuous research and community contributions. If you're interested in further understanding the topic, I recommend referring to relevant research papers and engaging with the YOLO community.

I hope this explanation helps! Let me know if you have any more questions.

Ludaman commented 1 year ago

@glenn-jocher , thank you for the response and confirming most of my thoughts.

TLDR: Any pointers to good papers showing how iou thresholds get applied to anchor boxes in actual CNNs? I can't say I have seen any which pay respect to reality of respective field, CIoU, stability, and matching objects to anchor BBoxes; most just say it was applied without giving the rubber meets the road details.

I can imagine why the training plan was chosen for iou_t of (0, 0.1, 0.7), but still am a little confused there. Mainly confused with the selection of zero as the lower bound, because from what I have played around with CIoU it will be negative when there is little or no overlap. Hence, an anchor box needs to be a mediocre guess to be considered a good guess, no training of bad guesses is entertained.

I imagine this iou_t training plan helps makes training more stable, helps limit the issue of receptive field we discussed above, and still benefits from the other aspects of CIoU. However, it does disregard the benefit of CIoU allowing training for non-overlapping boxes. Unless I am reading the code incorrectly regarding how iou_t from the train.py file is supposed to be applied.

Most of the papers I have found are more academic regarding the greatness of YOLO or GIoU. I say academic because they clearly show how GIoU is great, but don't give the nuance of how they apply it in real world context such that receptive field concept isn't thrown out the window, training remains stable, etc. I expect they did something similar to you all, but don't see that in their paper.

Any pointers to better research on matching anchors to truth BBoxes and applying training plans to CIoU would be appreciated. Also, appreciate any comments you may have on if I am reading iou_t plan implementation correctly and how it's starting at zero may disallow any non-overlapping boxes from learning.

Thanks again.

github-actions[bot] commented 11 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher commented 11 months ago

@Ludaman you're welcome! It's great to see your deep interest and thoughtful analysis of the training process. While I don't have specific papers to point you to, I recommend looking at recent publications and preprints in the field of object detection and anchor-based algorithms.

When it comes to applying training plans to CIoU and addressing the issue of non-overlapping boxes, it's indeed a complex task that requires a careful balance between stability, handling receptive fields, and effectively training the model. It's likely that the implementation involves various considerations that aim to strike this balance while leveraging the benefits of CIoU.

As for the iou_t training plan in YOLOv5 and how it may disallow non-overlapping boxes from learning, this might be a design decision with specific trade-offs in mind. However, for more insights into the YOLOv5 implementation specifics, I suggest reaching out to the Ultralytics team directly or initiating a discussion in the YOLO community forums.

I hope this guidance helps you in your research endeavors, and I commend your curiosity and thorough exploration of these topics. If you have further questions or findings to share, feel free to discuss them here or in the broader community.

Keep up the great work!

github-actions[bot] commented 10 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐