ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.8k stars 16.12k forks source link

Asking about YOLO head for detection Layers #11299

Closed supriamir closed 1 year ago

supriamir commented 1 year ago

Search before asking

Question

In default YOLOv5 have 3 detection layers in head. I am struggling to understand why

detection layer 1 : 80 x 80x 256 for detect small size object detection layer 2 : 40 x 40 x 512 for detect medium size object detection layer 3 : 20 x 20 x 512 for detect large size object.

why not 80x80x 256 is for large size object?

Additional

No response

github-actions[bot] commented 1 year ago

👋 Hello @supriamir, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics
glenn-jocher commented 1 year ago

@supriamir great question! YOLOv5's anchor-based approach relies on the detection layers to learn to detect objects of certain sizes. In general, small objects require higher resolution features and a larger number of anchor boxes, while large objects require lower resolution features and fewer anchor boxes. Therefore, larger objects are detected by the final detection layer with the lowest resolution and smallest number of anchor boxes, while smaller objects are detected by earlier layers with higher resolution and larger numbers of anchors.

In YOLOv5, these choices were made heuristically based on the authors' experimentation and intuition for balancing accuracy and speed. However, you can explore similar concepts in your own custom models by adjusting the number and size of anchor boxes, as well as the resolution and depth of feature maps in the detection layers.

supriamir commented 1 year ago

@glenn-jocher it is very clear explanation. thank you.

About the anchor boxes, as far as I know, yolov5 provides the autoanchor module. Should I adjust the number and size of anchor boxes or just use yolov5's default values when working with my own custom dataset?

glenn-jocher commented 1 year ago

@supriamir You're welcome!

Regarding the anchor boxes, YOLOv5's autoanchor module can estimate good anchor sizes based on the training data for you, so I recommend using it with the --autoanchor flag during training. Additionally, the default anchors in YOLOv5 were chosen via a clustering algorithm, and they have proven to work well for a wide variety of datasets, so leaving them as the default is a good starting point for most cases.

However, if you have prior knowledge about the object sizes in your dataset or would like to experiment with different anchor configurations, you might consider manually adjusting the number and size of the anchors. For example, you could increase the number of anchors to capture more variance in object sizes, or modify the anchor scales and aspect ratios to better align with your dataset's object proportions.

Overall, the best approach depends on your specific dataset and use case, so I recommend starting with the default anchors and experimenting with different configurations to see what works best for you.

supriamir commented 1 year ago

@glenn-jocher Thank you.

glenn-jocher commented 1 year ago

@supriamir You're welcome! Let me know if you have any other questions. Good luck with your project!

supriamir commented 1 year ago

Hi @glenn-jocher , I'm still confused about the size of the object in detecting layers.

For example. I have image input size model 640x640 with 3 channels. Lets say I have 3 objects in the image with different size like book (small), computer (medium) and person (large).

How do we know that this objects (the book, the computer, the person) belongs to the detection layer 1 : 160 x 160 x 128, detection layer 2 : 80 x 80 x 256 or detection layer 3 : 40 x 40 x 512 ?

glenn-jocher commented 1 year ago

Hi @supriamir! To understand how the algorithm assigns objects to specific detection layers, you need to first understand how the detection layers work in YOLOv5.

In YOLOv5, each detection layer is responsible for detecting objects of a certain size range. The size of the objects that a detection layer is responsible for depends on the size of the receptive field of the corresponding anchor boxes. The receptive field is the portion of the input image that affects the output of a neuron, and it is determined by the size and location of the anchor boxes.

During training, YOLOv5 learns to assign objects of different sizes to different detection layers by adjusting the confidence scores for the anchor boxes based on their receptive field. For example, if an object is small, YOLOv5 will assign it to the first detection layer with a higher resolution and smaller receptive field. On the other hand, if an object is large, YOLOv5 will assign it to the last detection layer with a lower resolution and larger receptive field.

To find out which detection layer is responsible for detecting a specific object in your input image, you need to first analyze the receptive fields of the anchor boxes in each detection layer. You can use the visualize.py script in YOLOv5 to do this. The script generates images that show the receptive fields of the anchor boxes on the input image. You can then compare the size of the object with the size of the receptive fields of the anchor boxes to determine which detection layer is most likely to detect the object.

I hope this explanation helps! Let me know if you have any more questions.

supriamir commented 1 year ago

@glenn-jocher thank you for the answer.

In my case, I add the new additional layer to detect small size objects. Now, I have 4 layers (default is 3). I have many different size and objects in my dataset. Can we get what objects (class) belong to the the detection layer 1 : 160 x 160 x 128?

glenn-jocher commented 1 year ago

@supriamir, in general, the assignment of object classes to specific detection layers is not based on object class, but rather on object size. This is because each detection layer is optimized to detect objects within a certain size range. YOLOv5 determines the optimal layer to assign an object based on the size of the object relative to the receptive field of the anchor boxes, which are set for each detection layer.

Thus, the best way to determine which detection layer will detect a specific object in your custom model is by analyzing the size of the object relative to the receptive field of the anchor boxes used in your model. You can visualize the receptive fields of the anchor boxes using visualize.py script in YOLOv5, as I explained before.

Once you know which detection layer is responsible for detecting each object, you can check the network architecture to see which anchor boxes are associated with that detection layer. Then you can see what object classes are assigned to those anchor boxes in your training dataset.

Note that if an object is too small to be detected by any of your detection layers, it may not be detected at all. Conversely, if an object is too large to be detected by any of your detection layers, it may be assigned to the largest detection layer and its accuracy may suffer as the receptive field may not be able to capture the full object.

supriamir commented 1 year ago

@glenn-jocher thank you for your answer.

I don't find visualize.py in yolov5 repo.

I have more questions, maybe there are very basic.

  1. I have made a diagram based on my case. This is correct or not?

Capture

  1. How much influence the anchors have on the performance model if we choose 2,3 or 4 anchors? If I have 4 layer predictions, what is the best number of anchors to choose? 3 or 4?
glenn-jocher commented 1 year ago

@supriamir you're welcome!

  1. Your diagram looks correct. The 4 detection layers get progressively lower in resolution and are responsible for detecting objects of larger sizes.

  2. The number and size of anchors have a significant impact on both the accuracy and speed of the model. Using too few anchors or anchors that don't match the aspect ratio and sizes of the objects in your dataset could result in lower accuracy, since YOLOv5 relies on the anchor boxes to predict the location and size of an object. Using too many anchors could make the model slower to train and inference.

In YOLOv5, the default number of anchors for each detection layer is 3, which is usually sufficient for a wide range of object categories and sizes. In general, you may want to experiment with the number and size of anchors to find the best balance of accuracy and speed for your specific use case. You can try to use the --autoanchor flag in YOLOv5 to automatically estimate the optimal anchor configuration based on your training data.

Regarding the best number of anchors for 4 layer predictions, it depends on the size and aspect ratios of the objects in your dataset. As I mentioned, the matching of the anchor sizes and aspect ratios to the object sizes and aspect ratios in your dataset is important for good performance. You may want to experiment with different configurations of anchors to find the best one for your dataset.

supriamir commented 1 year ago

@glenn-jocher

I do some calculation regarding anchors and the minimum size of object detects in layer predictions. Is it correct?

  1. Number of anchors to detect in every prediction layer

xsmall : Final feature map is 160x160x54 Number of grids are 160x160 = 25,600 if each cell/grids have 3 anchors so the number of anchors are 25,600 x 3 = 76,800 anchor boxes (before performs NMS)

small: Final feature map is 80x80x54 Number of grids are 80x80 = 6,400 if each cell/grids have 3 anchors so the number of anchors are 6,400 x 3 = 19,200 anchor boxes (before performs NMS)

medium: Final feature map is 40x40x54 Number of grids are 40x40 = 1,600 if each cell/grids have 3 anchors so the number of anchors are 1,600 x 3 = 4,800 anchor boxes (before performs NMS)

large: Final feature map is 40x40x54 Number of grids 20x20 = 1,600 if each cell/grids have 3 anchors so the number of anchors are 1,200 x 3 = 1,200 anchor boxes (before performs NMS)

  1. Prediction of bounding box of object. One ground truth Bounding Box (BB) for one object. First we need to define which cells this bounding box belongs to. When all cells that ground truth bounding box belongs to are identified (small, medium or large feature maps), the center cell is assigned by YOLO to be responsible for prediction this object. Am I right?

So my question, where do I can find the yolov5 code to manage/control size objects/bounding box is belong to the certain feature maps?

  1. I didn't find the visualize.py script in YOLOv5 as your mention before.
glenn-jocher commented 1 year ago

Hello @supriamir!

In response to your questions:

  1. Your calculations for the number of anchors in each prediction layer look correct. However, note that the actual number of anchors used during training and inference will be lower, as the low-confidence anchors are discarded during non-maximum suppression and post-processing.

  2. Yes, you are right. The ground truth bounding box is assigned to a certain feature map based on its size, and the responsibility for detecting it is assigned to the anchor box in the cell of that feature map that has the highest overlap with the ground truth bounding box. The code to manage/control the size of objects/bounding boxes belongs to certain feature maps is present in the YOLOv5 source code. To modify it, you can modify the configuration file in the 'models' directory.

  3. The 'visualize.py' script is located in the 'yolov5' directory, in the 'utils' subdirectory. If you cannot find it, you can download it from the official YOLOv5 repository on GitHub.

I hope this helps! Let me know if you have any more questions.

glenn-jocher commented 1 year ago

Hello @OmarSyaqif,

Yes, in YOLOv5, the feature map used to detect an object is chosen based on the size of the anchor boxes defined in the configuration file. As you mentioned, objects within a certain size range will be detected by a certain feature map, which has anchor boxes that are optimized for that size range. The size of the anchor boxes is defined by setting the anchor ratios and scales in the configuration file.

So, if an object is within the size range of the anchor boxes of a certain feature map, that feature map will be responsible for detecting that object using the anchors that have the best overlapping with the ground truth bounding box.

I hope this answers your question. Please let me know if you have any further questions.

OmarSyaqif commented 1 year ago

@glenn-jocher thanks for your answer.

If we have 3 different detections at 3 levels, how do we interpret results? Let us suppose an object A was detected in all of the 3 stages, object B in 2 and object c in just ` stage, how do we differentiate / normalize or combine the results in the very end?

glenn-jocher commented 1 year ago

@OmarSyaqif hello,

If an object is detected in all three detection stages, it means that the object is present in the image and has been detected with high confidence across all detection layers. If an object is detected in two detection stages, it means that it has been detected with high confidence in those regions of the image. If an object is detected only in one detection stage, it may have been missed in the other detection stages due to factors such as its size and location in the image.

To combine the results of multiple detections of the same object, one common strategy is to use a voting scheme. This would involve calculating the confidence scores of each detection, and then selecting the detection with the highest score as the final prediction for the object. Another strategy is to use a linear regression or clustering algorithm to predict the object's location and size based on the detections from multiple stages.

However, before attempting any of these strategies, it is important to ensure that the multiple detections indeed correspond to the same object. This can be achieved by using object tracking algorithms or applying non-maximum suppression (NMS) to remove redundant detections of the same object.

I hope this helps. Please let me know if you have any further questions.

OmarSyaqif commented 1 year ago

@glenn-jocher thank you for your response.

In default yolov5 will be predicting outputs at 3 different scale/level, which are

Are these sizes directly related to the grid cells? So we divide the image (ex. 640x640) into cell with a 80x80 grids (small), 40 x 40 (medium) and 20x20 (large)?

supriamir commented 1 year ago

Hi @OmarSyaqif this following link maybe answer your question about grids at the different scale in YOLO https://stats.stackexchange.com/questions/507090/what-are-grids-and-detection-at-different-scales-in-yolov3

supriamir commented 1 year ago

hi @glenn-jocher thank you for your detail answer. It is very clear.

Regarding the visualize.py, I have checked the YOLOV5 repository files, I still haven't found the file you mentioned.

visu

glenn-jocher commented 1 year ago

@OmarSyaqif hello,

Thank you for reaching out. I apologize for the confusion regarding the visualize.py script.

The visualize.py script was present in earlier versions of YOLOv5, but has since been removed. The script was used to visualize the detections made by the model on input images. However, the latest version of YOLOv5 uses wandb, a third-party visualization tool, to display the detections on a dashboard for inspection.

I hope this answers your question. Please let me know if you have any further questions.

glenn-jocher commented 1 year ago

@OmarSyaqif hello!

In the current implementation of YOLOv5, we do not provide Average Precision (AP) or Average Recall (AR) across the different detection scales (small, medium, and large).

However, you can calculate the AP and AR for each detection scale independently, using the relevant ground truth and predicted bounding boxes. You can then combine the AP and AR values across the different scales in a weighted average, based on the importance of each scale to your specific use case.

If you need further assistance on how to calculate AP and AR, please let us know and we would be happy to help.

Thank you for your interest in YOLOv5!

OmarSyaqif commented 1 year ago

Hi @glenn-jocher I continue to ask regarding calculating AP and AR for each detection scale.

Let say, I have this anchor

- [10,11, 22,12, 18,23, 35,24] # P2/4 --> xsmall

- [24,42, 40,42, 34,72, 72,45] # P3/8 --> small

- [78,78, 66,122, 171,52, 104,123] # P4/16 --> medium

- [90,190, 190,158, 165,259, 426,260] # p5/32 --> large

Input images (I convert from yolo format to pixel) are class , xcenter , ycenter , width , height 11 , 331 , 216, 119, 99 4 , 168 , 225 , 24, 35

How to calculate so that the input is processed in which layer ( xsmall, small, medium, large)?

glenn-jocher commented 1 year ago

Hello @OmarSyaqif,

To determine which detection scale (xsmall, small, medium, large) in YOLOv5 each object in the input image should be processed in, you need to first determine the anchor box that best matches the object's size and aspect ratio. This means you need to calculate the object's height and width, as well as its aspect ratio, and then compare those values to the anchor boxes of different scales.

For instance, in your example, the first object has width = 119, height = 99, and an aspect ratio of height/width = 0.832, while the second object has width = 24, height = 35, and an aspect ratio of height/width = 1.46. Based on these values, you can compare the objects to the anchor boxes at each scale.

The first object has a height and width that are closer to the anchor box at the medium scale, so it should be processed in the medium detection layer. The second object has a smaller size and a higher aspect ratio, which suggests that it should be processed in the small detection layer.

Once you have determined the detection layer for each object, you can calculate the AP and AR for each detection layer independently, as explained in the previous message.

I hope this helps you calculate AP and AR for each detection scale. Please let me know if you have any further questions.

OmarSyaqif commented 1 year ago

@glenn-jocher thank you for quick response.

It is very clear. Thank you very much

glenn-jocher commented 1 year ago

@OmarSyaqif hello!

To get the aspect ratio of an anchor box, you can calculate the ratio of its height to its width. In YOLOv5, the aspect ratios of the anchor boxes at different scales are predetermined based on the default anchor box sizes. You can find the default anchor box sizes for each scale in the YOLOv5 configuration file (yolov5s.yaml, yolov5m.yaml, etc.).

The aspect ratio ranges for each scale are determined by the aspect ratios of the default anchor boxes. In YOLOv5, the anchor boxes at each scale have three different aspect ratios, typically corresponding to the three colors of the boxes in the resulting detections. For example, the default anchor boxes for the small scale might have aspect ratios of 1.0, 0.5 and 2.0. During training and inference, each object in the input image is matched to the anchor box whose aspect ratio and size best matches the object.

I hope this answers your question! Let me know if you need further clarification.

supriamir commented 1 year ago

@glenn-jocher thank you very much

glenn-jocher commented 1 year ago

@supriamir sure, here's a suggested response:


Hello!

Thank you for your question. In YOLOv5, the aspect ratios of the anchor boxes at different scales are predetermined based on the default anchor box sizes. You can find the default anchor box sizes for each scale in the YOLOv5 configuration file (yolov5s.yaml, yolov5m.yaml, etc.).

The aspect ratio ranges for each scale are determined by the aspect ratios of the default anchor boxes. Typically, the anchor boxes at each scale have three different aspect ratios, and each object in the input image is matched to the anchor box whose aspect ratio and size best matches the object.

I hope this explanation helps. Let me know if you have any further questions.

Thank you for your interest in YOLOv5!

supriamir commented 1 year ago

Hi @glenn-jocher

Thank you for your answer.

I have obtained the model prediction results and compared them with the ground truth. Now I am trying to get the Confusion matrix. Do I also have to define the confidence in deciding TP, FP, TN and FN? for example if I define confidence > 0.5, what about TP with confidence values below <= 0.5? Does the object become FP?

Here is my example output confusion matrix ok

glenn-jocher commented 1 year ago

@supriamir hello,

In order to calculate the confusion matrix, you need to define a threshold for the confidence score at which a detection is considered a true positive or false positive. Typically, a detection with a confidence score greater than or equal to the threshold is considered a true positive, while a detection with a confidence score lower than the threshold is considered a false positive.

In your example, if you define the threshold as 0.5, all detections with scores greater than or equal to 0.5 would be considered true positives, and all other detections would be considered false positives.

Keep in mind that the threshold you choose may depend on the specific requirements of your application or use case. Additionally, the threshold may affect the trade-off between precision and recall in your detection results.

I hope this helps you calculate the confusion matrix. Let me know if you have any further questions.

Thank you for using YOLOv5!

supriamir commented 1 year ago

hi @glenn-jocher

I have anchor size width, height, and ratio for each anchor. Show in this table scale1

below is the sample inputs. scale2

I have difficulty to define the range (width, height, and ration) for my conditional statement in my code. Sometimes there is overlap range on the width, height or ratio between the scales. Example input object 1 : The width (27) seems close to the scale 1 (22) but the height (22) close to the scale 2 (25) and the ratio (0.81) is close to 0.8

input object 2 : The width (30) seems close to the scale 2 (35) but the height (33) close to the scale 1 (37) and the ratio (1.1) is close to 1.2

I confuse to define the range in conditional statement for my code

anchor_scale_1 : (10,12) , (20,16), (22,37) anchor_scale_2 : (35,25) , (37,48), (69,54)

if (width,height,ratio) in (anchor_scale_1): print("layer xsmall") else if (width,height,ratio) in anchor_scale_2 : print(" layer small") else : so on ..

Thank you.

glenn-jocher commented 1 year ago

@supriamir hello!

To determine the detection layer in YOLOv5 where an object with given width, height, and aspect ratio should be processed, you can compare the object's size and aspect ratio to the ranges of anchor boxes at each scale. The anchor box ranges for each scale are defined based on the default anchor box sizes in the YOLOv5 configuration file (yolov5s.yaml, yolov5m.yaml, etc.).

In your example, the width and height of object 1 suggest that it should be processed in scale 1, while the aspect ratio suggests that it should be processed in scale 2. For object 2, the width suggests that it should be processed in scale 2, while the height and aspect ratio suggest that it should be processed in scale 1. In cases where there is overlap between the ranges of anchor boxes at different scales, you can prioritize based on the aspect ratio, which may best indicate the scale where the object belongs.

Here's an example Python code snippet that might help you define the conditional statements for your case:

def find_scale(width, height, aspect_ratio):
    # Define the anchor box ranges for each scale
    anchor_scale_1 = [(10, 12, 0.4), (20, 16, 0.8), (22, 37, 1.4)]
    anchor_scale_2 = [(35, 25, 0.6), (37, 48, 1.2), (69, 54, 2.0)]

    for a in anchor_scale_1:
        if a[0] <= width <= a[1] and a[2]-0.2 <= aspect_ratio <= a[2]+0.2:
            return "xsmall"

    for a in anchor_scale_2:
        if a[0] <= width <= a[1] and a[2]-0.2 <= aspect_ratio <= a[2]+0.2:
            return "small"

    # Add additional scales as needed

    return "unknown" # or raise an exception, or handle as needed

# Example usage:
print(find_scale(27, 22, 0.81)) # Output: xsmall
print(find_scale(30, 33, 1.1)) # Output: small

In this example code, the function find_scale

supriamir commented 1 year ago

Hi @glenn-jocher

Thank you very much for your answer.

glenn-jocher commented 1 year ago

@supriamir you're welcome! I'm glad I could help you. If you have any further questions or concerns, don't hesitate to ask. Have a great day!

supriamir commented 1 year ago

Hi @glenn-jocher

I need you explanation regarding your Python code snippet

  1. How did you get the aspect_rasio for each scales? Is it default value?

first scale

anchor_scale_1 = [(10, 12, 0.4), (20, 16, 0.8), (22, 37, 1.4)]

your aspect_rasio = 0.4 , 0.8 , 1.4

my calculation

aspect_rasio1 = heigth/weight = 12/10 = 1.2 ? aspect_rasio2 = heigth/weight = 16/20 = 0.8 ? aspect_rasio3 = heigth/weight = 37/22 = 1.6 ?

anchor_scale_2 = [(35, 25, 0.6), (37, 48, 1.2), (69, 54, 2.0)]

your aspect_rasio = 0.6,1.2, 2.0

my calculation

aspect_rasio1 = heigth/weight = 25/35 = 0.71 ? aspect_rasio2 = heigth/weight = 48/37 = 1.2 ? aspect_rasio3 = heigth/weight = 37/22 = 0.75 ?

I am not use the default anchor_size in (yolov5s.yaml, yolov5m.yaml, etc.) file. I use --autoanchor to generate the anchor_size as you suggested.

  1. Why do you use width range between a[0] and a[1]? as I know a[0] is the width and a[1] is the height.

I am sorry asking many times. Very sorry.

glenn-jocher commented 1 year ago

@supriamir hello,

Thank you for your question. To answer your first question, the aspect ratios in the example code snippet I provided were based on the default anchor box sizes in the YOLOv5 configuration files (yolov5s.yaml, yolov5m.yaml, etc.). If you are using the --autoanchor option to generate the anchor box sizes, you can use your own aspect ratios based on the dimensions of the generated boxes. You can calculate the aspect ratio of an anchor box as the ratio of its height to its width, as you have done in your own calculations.

Regarding your second question, I apologize for the confusion. You are correct that a[0] in the example code refers to the width of the anchor box, and a[1] refers to the height. The width range in the conditional statement should have used a[1] instead of a[0]. Here is an updated version of the code snippet with that correction:

def find_scale(width, height, aspect_ratio):
    # Define the anchor box ranges for each scale
    anchor_scale_1 = [(10, 12, 0.4), (20, 16, 0.8), (22, 37, 1.4)]
    anchor_scale_2 = [(35, 25, 0.6), (37, 48, 1.2), (69, 54, 2.0)]

    for a in anchor_scale_1:
        if a[0] <= width <= a[1] and a[2]-0.2 <= aspect_ratio <= a[2]+0.2:
            return "xsmall"

    for a in anchor_scale_2:
        if a[1] <= height <= a[2] and a[2]-0.2 <= aspect_ratio <= a[2]+0.2:
            return "small"

    # Add additional scales as needed

    return "unknown" # or raise an exception, or handle as needed

# Example usage:
print(find_scale(27, 22, 0.81)) # Output: xsmall
print(find_scale(30, 33, 1.1)) # Output: small

Thank you for bringing this to our attention. We appreciate your thoroughness and attention to detail.

Please let me know if you have any

supriamir commented 1 year ago

Hi @glenn-jocher

I still don't understand the range for the height and weight. Based on your code. It seems that the first looping the range of width is between a[0] = width and a [1] = height and the second looping the range of height is between a[1] = height and a[2] = aspect_ratio ?

for a in anchor_scale_1: if a[0] <= width <= a[1] and a[2]-0.2 <= aspect_ratio <= a[2]+0.2: return "xsmall"

for a in anchor_scale_2: if a[1] <= height <= a[2] and a[2]-0.2 <= aspect_ratio <= a[2]+0.2: return "small"

glenn-jocher commented 1 year ago

@supriamir hi,

I apologize for the confusion. You are correct that in my previous response, I made an error in describing the ranges for the conditional statements of the anchor box scales. Let me explain the ranges more clearly:

For each anchor box scale i, we define the range of object sizes and aspect ratios that are associated with that scale. We can define this range by looking at the height, width, and aspect ratio of the anchor boxes at that scale. For example, let's say that we have an anchor box scale i with anchor boxes of sizes (w1, h1), (w2, h2), and (w3, h3), and aspect ratios a1, a2, and a3. We can define the range for scale i as follows:

(min_w, max_w) = (w1 - eps, w3 + eps)
(min_h, max_h) = (h1 - eps, h3 + eps)
(min_a, max_a) = (min(a1, a2, a3) - eps, max(a1, a2, a3) + eps)

where eps is a small value that defines the tolerance for the range. This range definition assumes that the anchor boxes at scale i are arranged in increasing order of size and aspect ratio.

With this range definition in hand, we can now write the conditional statements to find the appropriate scale for a given object. Let (width, height, aspect_ratio) be the size and aspect ratio of the object. We can then check the ranges for each scale as follows:

for scale 1:

if min_w <= width <= max_w and min_h <= height <= max_h and min_a <= aspect_ratio <= max_a:
    return "scale 1"

for scale 2:

if min_w <= width <= max_w and min_h <= height <= max_h and min_a <= aspect_ratio <= max_a:
    return "scale 2"

and so on for additional scales.

I hope this clears things up. If you need further clarification, please let me know. Thank you for your patience and understanding.

supriamir commented 1 year ago

Hi @glenn-jocher

Thank your for your answer. I need a clarification from your formula.

(min_w, max_w) = (w1 - eps, w3 + eps) (min_h, max_h) = (h1 - eps, h3 + eps)

from your formula, the w3 always be max_w and w1 always be min_w, it is also happen in min_h and max_h. But the anchor box size that we generate sometimes the max_w is w1 or w2.

anchor_scale_1 = [(10, 12, 1.2), (20, 16, 0.8), (22, 37, 1.6)] anchor_scale_2 = [(35, 25, 0.7), (37, 48, 1.3), (69, 54, 0.78)] anchor_scale_3 = [(128, 39, 0.6), (60, 110, 1.83), (95, 125, 1.32)] --> min_w = w2=60, max_w = w1= 128 anchor_scale_4 = [(164, 136, 0.83), (154, 288, 1.87), (341, 251, 0.74)] --> min_w = w2 = 154, max_h = w3=341

Maybe it should be like this: (min_w, max_w) = (w1 - eps, w3 + eps) --> ((min(w1,w2,w3)-eps, max(w1,w2,w3)+eps)) (min_h, max_h) = (h1 - eps, h3 + eps) --> ((min(h1,h2,h3)-eps, max(h1,h2,h3)+eps)) (min_a, max_a) = (min(a1, a2, a3) - eps, max(a1, a2, a3) + eps)

or we calculate based on the area of the anchor boxes

anchor_scale_3 = [(128, 39, 0.6), (60, 110, 1.83), (95, 125, 1.32)]

area_anchor_scale_3[0] = width1 height1 = 128 60=4992 area_anchor_scale_3[1] = width2 height2 = 60 110=6600 area_anchor_scale_3[2] = width3 height3 = 95 125=11875

so the area always area_anchor_scale_3[0] <= area <= area_anchor_scale_3[2]

Thank you so much for always responding and answering my questions.

glenn-jocher commented 1 year ago

@supriamir hello,

Thank you for your question. You are correct that the formula I provided for min_w, max_w, min_h, and max_h may not work correctly for cases where the smallest anchor box in a scale has a larger width or height than the largest anchor box. Your alternate suggestion to compute the range based on the minimum and maximum values of the anchor box dimensions (w and h) could work better in these cases.

Another approach that you could consider is to compute the range based on the area of the anchor boxes, as you suggested. Since the anchor box dimensions tend to vary more in proportion to each other than independently, using the area may provide a better estimate of the object size. You could define the range for scale i as follows:

(min_area_i, max_area_i) = (area_i - eps, area_i + eps)

where area_i is the median area of the anchor boxes at scale i, and eps is the tolerance value as before. You can then check the ranges for each scale as follows:

for scale 1:

if min_area_1 <= area <= max_area_1 and min_a <= aspect_ratio <= max_a:
    return "scale 1"

for scale 2:

if min_area_2 <= area <= max_area_2 and min_a <= aspect_ratio <= max_a:
    return "scale 2"

and so on for additional scales.

I hope that helps. Thank you for bringing this issue to our attention and for your continued interest in YOLOv5. Let me know if you have any further questions or concerns.

supriamir commented 1 year ago

Hi @glenn-jocher

I made the code following the formula. It works.

Some box sizes of object are defined as "unknown" scales. Could it possible for the "unknown" scales to be considered as one of the other scales that has comparable size ranges and aspect ratio? or make them as FP to calculates the recall and precision for each scales?

glenn-jocher commented 1 year ago

@supriamir hello,

I'm glad to hear that the formula I provided worked for you.

Regarding the issue of some object sizes being defined as "unknown" scales, it would be possible to consider these sizes as belonging to one of the other scales if their ranges for size and aspect ratio are comparable. However, this would require a careful evaluation of the scale ranges for each anchor box size to determine whether this approach is appropriate. Another approach would be to treat object detections with unknown scales as false positives (FP) and calculate the recall and precision separately for each scale. This would allow you to evaluate the performance of each scale independently and provide more detailed insights into the strengths and weaknesses of your detection model.

I hope this helps. Let me know if you have any further questions or concerns.

Best regards,

YourName

elepherai commented 1 year ago

@glenn-jocher Hello, could you provide visualize.py?

glenn-jocher commented 1 year ago

@elepherai hello,

Thank you for your question. Unfortunately, we are unable to provide the visualize.py file directly as it is not included in the official YOLOv5 repository. However, you can refer to the repository's utils directory to find the visualization.py file, which contains functions for visualizing object detections. You can also check out the official YOLOv5 documentation for more details on how to visualize your YOLOv5 outputs.

If you have any more questions or need further assistance, please feel free to ask.

elepherai commented 1 year ago

@glenn-jocher Hi, I couldn't find the visualization.py file in the utils directory. image

glenn-jocher commented 1 year ago

@elepherai hello,

I apologize for the confusion. It appears that the visualization.py file is not included in the official YOLOv5 repository. However, you can still visualize your object detections by using the plot_results() function in the utils/general.py file. This function takes the predicted bounding boxes, labels, and image tensor as inputs and generates an image with the bounding box overlays. You can try using this function to visualize your YOLOv5 outputs.

If you have any further questions or need additional assistance, please let me know.

AdnanMunir294 commented 1 year ago

I have a question. when I add an extra grid in SPPF it gives good result for extra large scale but it reduces overall accuracy on small objects why? please I need help asap. Thanks

glenn-jocher commented 1 year ago

@AdnanMunir294 hi,

Adding an extra grid to the Spatial Pyramid Pooling (SPP) module can improve the detection of larger objects by capturing more context. However, this can come at the expense of smaller object detection due to dilution of fine-grained features. The additional grid may dilute the features relevant to small objects, leading to a reduction in overall accuracy for small objects.

To address this, you can consider adjusting the anchor box scales in the detection head to better suit the particular object sizes you are interested in. You can experiment with different anchor box scales and ratios to find the optimal configuration for detecting small objects while maintaining good performance on larger objects.

Additionally, you can try augmenting your dataset with more small object instances and potentially employing other techniques such as data augmentation, model ensembling, or transfer learning to improve detection accuracy on small objects.

I hope this helps clarify the behavior you are observing. If you have further questions or need additional assistance, please let me know.

Thanks!

AdnanMunir294 commented 1 year ago

@glenn-jocher Thanks for your help. anchors:

[10,13, 16,30, 33,23] # P3/8 [30,61, 62,45, 59,119] # P4/16 [116,90, 156,198, 373,326] # P5/32 I have a question like how to find these anchor box values? and what exactly these values represent. As for my understanding, the first is width, the second is height, and what other values? How can I find my anchor based on my custom dataset for noautoanchor?

glenn-jocher commented 1 year ago

@AdnanMunir294 the anchor box values you provided represent the width and height of each anchor box. For each stage of the network (P3, P4, P5), there are two sets of anchor boxes defined, each with three anchor boxes.

To determine these anchor box values, you can follow a few steps:

  1. Initial Anchors: Start with some initial anchor box values. YOLOv5 typically uses a predefined set of anchor box ratios that have been experimentally determined to work well on general object detection tasks.

  2. Generate Anchors: You can use a technique called k-means clustering on your custom dataset to generate anchor boxes specific to your dataset. The goal is to group the object instances in your dataset into clusters based on similarity in size. The centers of these clusters will then serve as your anchor box values.

  3. No AutoAnchor: By setting the --noautoanchor flag during training, you can prevent the automatic generation of anchor box values based on your dataset. Instead, you can manually specify the anchor box values using the --anchors argument when training your custom dataset. You can input your custom anchor box values as a list of comma-separated width and height pairs.

I hope this clarifies the process of finding anchor box values for your custom dataset. If you have any further questions or need additional assistance, please let me know.