ultralytics / ultralytics

NEW - YOLOv8 πŸš€ in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
28.78k stars 5.71k forks source link

How does YOLOv8 anchor-free detection work? #3362

Closed windevlearn closed 9 months ago

windevlearn commented 1 year ago

Search before asking

Question

Hello,

i am a bit confused when it comes to the anchor-free approach used by YOLOv8.

In an anchor-based approach, we usually have a bunch of predefined anchor-boxes (with different scales and aspect ratios) for each grid-cell. And a given prediction usually consists of four values (x_offset, y_offset and scale_width, scale_height) which are all relative to a given anchor box.

So as far as I understand it, in an anchor-free approach, we no longer have predefined anchor-boxes, but we still have grid-cells.

In one article I read about YOLOv8 it said the following: "Anchor-free detection is when an object detection model directly predicts the center of an object instead of the offset from a known anchor box." https://medium.com/cord-tech/yolov8-for-object-detection-explained-practical-example-23920f77f66a

So does this mean that the center of an object is predicted directly (as an offset from the top left image corner)? And are the width and height of an object also directly predicted? Or is this still relative to some kind of box/grid-cell?

Also how does the "Reg_Max" parameter play into all that? In a previous answer @glenn-jocher said that "Reg_Max parameter is used to define the maximum range of anchor parameters #3072". But I thought we no longer have any anchor-boxes, so why do we have a parameter to set the range of anchors?

I hope my questions made sense, and someone can help me understand the anchor-free approach. Best regards and many thanks in advance.

Additional

No response

github-actions[bot] commented 1 year ago

πŸ‘‹ Hello @windevlearn, thank you for your interest in YOLOv8 πŸš€! We recommend a visit to the YOLOv8 Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a πŸ› Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.7 environment with PyTorch>=1.7.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

frabob2017 commented 1 year ago

I also want to understand this question? Since the center and height and weight can be predicted directly, why early version YOLO need anchor boxes to guide this prediction? I guess the direct guess results are not good and thus need anchor box help. In YOLOv8, maybe Jocher use different calculation method instead of anchor box to help calculate the center and h,w. But I still do not think it is "directly".

tongchangD commented 1 year ago

@windevlearn in whis page One reason not to use YOLOv8

why YOLOv8 does not support model strained in 1280 (in pixels) resolution

best wish

glenn-jocher commented 1 year ago

@tongchangD hi there, thank you for your inquiry.

YOLOv8 does indeed support model training on a 1280 pixel resolution. The ability to train a model at varying resolutions is one of the flexible features provided by this YOLO implementation. The input resolution is configurable and is not limited to any specific value. It can be adjusted to suit the specific requirements of your project or the characteristics of your dataset.

When selecting a resolution to train your model, it's important to consider factors such as the details of your target objects in the images and your computational resources. Higher resolution training may increase the ability of your model to recognize smaller or more detailed objects, but it also requires more computational power and may take longer to train.

Finally, the "Reg_Max" parameter you asked about earlier is there to constrain the range of predicted boxes, preventing them from becoming too large or too small. Even though YOLOv8 is anchor-free, we still need to set certain boundaries to ensure sensible predictions.

I hope this helps. Let me know if you have any other questions!

github-actions[bot] commented 11 months ago

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

uname0x96 commented 10 months ago

Search before asking

  • [x] I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

Hello,

i am a bit confused when it comes to the anchor-free approach used by YOLOv8.

In an anchor-based approach, we usually have a bunch of predefined anchor-boxes (with different scales and aspect ratios) for each grid-cell. And a given prediction usually consists of four values (x_offset, y_offset and scale_width, scale_height) which are all relative to a given anchor box.

So as far as I understand it, in an anchor-free approach, we no longer have predefined anchor-boxes, but we still have grid-cells.

In one article I read about YOLOv8 it said the following: "Anchor-free detection is when an object detection model directly predicts the center of an object instead of the offset from a known anchor box." https://medium.com/cord-tech/yolov8-for-object-detection-explained-practical-example-23920f77f66a

So does this mean that the center of an object is predicted directly (as an offset from the top left image corner)? And are the width and height of an object also directly predicted? Or is this still relative to some kind of box/grid-cell?

Also how does the "Reg_Max" parameter play into all that? In a previous answer @glenn-jocher said that "Reg_Max parameter is used to define the maximum range of anchor parameters #3072". But I thought we no longer have any anchor-boxes, so why do we have a parameter to set the range of anchors?

I hope my questions made sense, and someone can help me understand the anchor-free approach. Best regards and many thanks in advance.

Additional

No response

https://towardsdatascience.com/forget-the-hassles-of-anchor-boxes-with-fcos-fully-convolutional-one-stage-object-detection-fc0e25622e1c You can reading this one. The main ideal is similar to CenterNet.

glenn-jocher commented 10 months ago

@uname0x96 hello,

Thank you for your question, it's a good one.

In an anchor-free approach such as used in YOLOv8, we don't have predefined anchor boxes but we still do have grid cells. Predictions are made directly, rather than as an offset from a predefined anchor box. The center of an object is predicted from within the grid cell where the object resides, not as an offset from the top-left image corner. The width and height of the object are also predicted directly but are constrained to positivity to ensure valid box dimensions.

Regarding the "Reg_Max" parameter, despite being anchor-free, YOLOv8 still requires some form of regularization for the bounding box dimensions to prevent degenerate predictions. The "Reg_Max" parameter is used to define the maximum range of width and height parameters used in the prediction. It doesn't relate to anchor boxes, rather it is a form of regulation used to constrain the directly predicted box.

I hope this clears up your confusion about the anchor-free approach used in YOLOv8. Please feel free to ask if you have any further questions. Best regards!

github-actions[bot] commented 9 months ago

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

supriamir commented 4 months ago

Hi @glenn-jocher regarding the YOLOv8 uses anchor-free.

The final feature map for each prediction layer follow this equation?

Example in small detection head (80x80 x B(5+C))

B= number of anchor box C= number of classes

How does YOLOv8 define number of B ?

Thank you

glenn-jocher commented 4 months ago

@supriamir hi there! 😊 Great question!

In YOLOv8, which uses an anchor-free approach, the concept of 'B' or the number of anchor boxes doesn't directly apply as it does in anchor-based models. Instead, YOLOv8 directly predicts box coordinates without predefined anchors.

So, for each cell in the feature map, YOLOv8 outputs a set of predictions that include objectness score, class probabilities, and box coordinates. Assuming 'C' is the number of classes, the prediction per cell would be (4+1+C), where '4' represents the box coordinates, '1' for objectness score, and 'C' for class probabilities.

Feel free to explore more about this innovative approach and let me know if you have any more questions! Happy coding! πŸš€

supriamir commented 4 months ago

@glenn-jocher thank you for your answer.

So now the formula is (80x80 x (5+C)) for small detection head. right?

another question.

the output from neck (80x80x256) then by applying the conv with k,s,p the feature map size on Detect become 80x80 x (5+C)) for small detection head?

Tom-Teamo commented 4 months ago

@glenn-jocher I am confused about the objectness score.

Is there still a objectness score prediction in YOLOv8?

In DetectV8 Head:

def forward(self, x):
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        for i in range(self.nl):
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
        if self.training:  # Training path
            return x

it seems no objectness score.

glenn-jocher commented 4 months ago

Hello @Tom-Teamo,

Yes, YOLOv8 still predicts an objectness score, which indicates the confidence that a given bounding box contains an object. The code snippet you provided shows the forward pass of the DetectV8 Head, where feature maps are processed through convolutional layers. The objectness score, along with class probabilities and bounding box coordinates, are typically encoded together in the output tensor from the final convolutional layers, not explicitly shown in the snippet you posted.

The objectness score is implicitly handled within the model's architecture and is crucial for distinguishing between background and potential objects. If you need further clarification or more details on how these scores are integrated and used during training and inference, feel free to ask!

Happy coding! 😊

guanning03 commented 2 months ago

I have a question. So if their are more than one objects in a grid, yolov8 is not able to detect them all, is my understanding right? Thanks

pderrenger commented 2 months ago

Thank you for your question. YOLOv8 can detect multiple objects within a single grid cell. The model predicts bounding boxes and class probabilities for each object independently, allowing it to handle multiple objects in close proximity. If you encounter issues with detecting multiple objects, ensure you are using the latest version of the package and that your training data is properly annotated. If the problem persists, please provide more details so we can assist you further.