Open hbzhang opened 6 years ago
Here's a quick explanation based on what I understand (which might be wrong but hopefully gets the gist of it). After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes.
Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map): https://github.com/pjreddie/darknet/blob/6f6e4754ba99e42ea0870eb6ec878e48a9f7e7ae/src/yolo_layer.c#L88-L89
x[...]
- outputs of the neural network
biases[...]
- anchors
b.w
and b.h
result width and height of bounded box that will be showed on the result image
Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.
In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width=
and height=
in the cfg-file).
In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).
Thanks!
For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?
For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?
I was wondering the same. The more anchors used, the higher the IoU; see (https://medium.com/@vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807). However, when you try to detect one class, which often show the same object aspect ratios (like faces) I don't think that increasing the number of anchors is going to increase the IoU by a lot. While the computational overhead is going to increase significantly.
I used YOLOv2 to predict some industry meter board few weeks ago and I try the same idea spinoza1791 and CageCode refered, The reason was that I need high accuracy but also want close to real time so I thought change num of anchors (YOLOv2 -> 5) but it all end to crush after about 1800 iteration So I might lose someing there
@AlexeyAB How do you get the initial anchor box dimensions after clustering? The width and height after clustering are all number s less than 1, but anchor box dimensions are greater of less than 1. How to get the anchor box dimensions?
Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):
Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2stride]) biases[2n] / w; b.h = exp(x[index + 3stride]) biases[2n+1] / h;
* `x[...]` - outputs of the neural network * `biases[...]` - anchors * `b.w` and `b.h` result width and height of bounded box that will be showed on the result image
Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.
In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (
width=
andheight=
in the cfg-file).In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).
great explanation bro. thank you.
Sorry, still unclear phrase In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map What are "final feature map" sizes? For yolo-voc.2.0.cfg input image size is 416x416, anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52. I got- each pair represents anchor width and height, centered in every of 13X13 cells. The last anchor- 16.62 (width?), 10.52(height?)-what units they are? Can somebody explain litterally with this example? And, may be, someone uploaded the code for deducing best anchors from given dataset with K-means?
I think maybe your anchor has some error. In yolo2 the anchor size is based on final feature map(13x13) as you said. So the anchor aspect ratio must be smaller than 13x13 But in yolo3 the author changed anchor size based on initial input image size. As author said: "In YOLOv3 anchor sizes are actual pixel values. this simplifies a lot of stuff and was only a little bit harder to implement" Hope I am not missing anything :)
Dears,
is it necessarily to get the anchors values before the training to enhance the model?
I am building my own data set to detect 6 classes using tiny yolov2 and I used the below code to get anchors values do I need to change the width and height if I am changing it in the cfg file ?
are the below anchors accepted or the values are huge values ? what is the num_of_clusters 9 ?
....\build\darknet\x64>darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416
num_of_clusters = 9, width = 416, height = 416 read labels from 8297 images loaded image: 2137 box: 7411
Wrong label: data/obj/IMG_0631.txt - j = 0, x = 1.332292, y = 1.399537, width = 0.177083, height = 0.412037 loaded image: 2138 box: 7412 calculating k-means++ ...
avg IoU = 59.41 %
Saving anchors to the file: anchors.txt anchors = 19.2590,25.4234, 42.6678,64.3841, 36.4643,117.4917, 34.0644,235.9870, 47.0470,171.9500, 220.3569,59.5293, 48.2070,329.3734, 99.0149,240.3936, 165.5850,351.2881
To get anchor value first makes training time faster but not necessary tiny yolo is not quite accuracy if you can I adjust you use yolov2
@jalaldev1980 I try to guess, where did you take this _calcanchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?
@jalaldev1980 I try to guess, where did you take this _calcanchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?
./darknet detector calc_anchors your_obj.data -num_of_clusters 9 -width 416 -height 416
check How to improve object detection section at https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
for tiny yolo check the comments at https://github.com/pjreddie/darknet/issues/911
let me know if you find any other resources and advices
On Wed, 21 Nov 2018 at 09:34, andyrey notifications@github.com wrote:
@jalaldev1980 https://github.com/jalaldev1980 I try to guess, where did you take this calc_anchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pjreddie/darknet/issues/568#issuecomment-440536980, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq5IBlNGUlzAo6_rYn4j0sN6gOXWFiayks5uxOX7gaJpZM4S7tc_ .
Can someone provide some insights into YOLOv3's time complexity if we change the number of anchors?
Hi guys,
I got to know that yolo3 employs 9 anchors, but there are three layers used to generate yolo targets. Does this mean, each yolo target layer should have 3 anchors at each feature point according to their scale as does in FPN, or do we need to match all 9 anchors with one gt on all the 3 yolo output layers?
I use single set of 9 anchors for all of 3 layers in cfg file, it works fine. I believe, this set is for one base scale, and rescaled in the other 2 layers somewhere in framework code. Let someone correct me, if I am wrong.
Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):
darknet/src/yolo_layer.c Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2stride]) biases[2n] / w; b.h = exp(x[index + 3stride]) biases[2n+1] / h;
x[...]
- outputs of the neural networkbiases[...]
- anchorsb.w
andb.h
result width and height of bounded box that will be showed on the result imageThus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.
In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (
width=
andheight=
in the cfg-file).In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).
Thanks, but why darknet's yolov3 config file https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-voc.cfg and https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg have different input size(416 and 608), but use the same anchor size?If yolo v3 anchors are sizes of objects on the image that resized to the network size.
@weiaicunzai You are right, 2 different input size (416 and 608) cfg files have the same anchor box sizes. Seems to be a mistake. As for me, I use utilite to find anchors specific to my dataset, it increases accuracy.
Hi, Here I have some anchor question please: If I did not misunderstand the paper, there is also a positive-negative mechanism in yolov3, but only when we compute confidence loss, since xywh and classification only rely on the best match. Thus the xywh loss and classification loss are computed with gt and only one associated match. As for the confidence, the division of positive and negative is based on the iou value. Here my question is: is this iou computed between gt and the anchors, or between gt and the predictions which are computed from anchor and the model outputs(output is the offset generated from the model)?
Say I have a situation where all my objects that I need to detect are of the same size 30x30 pixels on an image that is 295x295 pixels, how would I go about calculating the best anchors for yolo v2 to use during training?
@Sauraus There is special python program, see AlexeyAB reference on github, which calculates 5 best anchors based on your dataset variety(for YOLO-2). Very easy to use. Then replace string with new anchor boxes in your cfg file. If you have same size objects, it probably would give you set of same pair of digits.
@andyrey are you referring to this: https://github.com/AlexeyAB/darknet/blob/master/scripts/gen_anchors.py by any chance?
@Sauraus: Yes, I used this for YOLO-2 with cmd: python gen_anchors.py -filelist train.txt -output_dir ./ -num_clusters 5
and for 9 anchors for YOLO-3 I used C-language darknet: darknet3.exe detector calc_anchors obj.data -num_of_clusters 9 -width 416 -height 416 -showpause
Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?
Yes and it's driving me crazy.
Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?
I think that the bounding box is hard to precisely fit your target There is always some deviation, just how much the degree of error it is. If the error is very large maybe you should check your training data and test data But still there is so many possible reason cause that Maybe you can post your picture?
Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):
Lines 88 to 89 in 6f6e475
b.w = exp(x[index + 2stride]) biases[2n] / w; b.h = exp(x[index + 3stride]) biases[2n+1] / h;
x[...]
- outputs of the neural networkbiases[...]
- anchorsb.w
andb.h
result width and height of bounded box that will be showed on the result imageThus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.
In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (
width=
andheight=
in the cfg-file).In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).
Can someone clarify why we take the exponential of predicted widths and heights? Why not just multiple anchor coordinates with them instead of taking exponential first?
Extremely useful discussion - thanks all - have been trying to understand Azure Cognitive Services / Microsoft Custom Vision object detection. I had been wondering where their exported anchor values came from. It's now fairly clear they do transfer learning off YOLO.
I am training my objects using YOLO v3 9 anchors 5 objects 45000 iterations (4 days) the system is not detecting any objects !
any thought ?
On Thu, 28 Feb 2019 at 16:38, jtlz2 notifications@github.com wrote:
Extremely useful discussion - thanks all - have been trying to understand Azure Cognitive Services / Microsoft Custom Vision object detection. I had been wondering where their exported anchor values came from. It's now fairly clear they do transfer learning off YOLO.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pjreddie/darknet/issues/568#issuecomment-468257558, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq5IBkTnLa8x1zv3sDnGNK6MDnJb6KiTks5vR83VgaJpZM4S7tc_ .
@AlexeyAB How do you get the initial anchor box dimensions after clustering? The width and height after clustering are all number s less than 1, but anchor box dimensions are greater of less than 1. How to get the anchor box dimensions?
YOLO's anchors are specific to dataset that is trained on (default set is based on PASCAL VOC). They ran a k-means clustering on the normalized width and height of the ground truth bounding boxes and obtained 5 values.
The final values are based on not coordinates but grid values. YOLO default set: anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071 this means the height and width of first anchor is slightly over one grid cell [1.3221, 1.73145] and the last anchor almost covers the whole image [11.2364, 10.0071] considering the image is 13x13 grid.
Hope this gives you a bit clearer idea if not complete.
Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map): darknet/src/yolo_layer.c Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2_stride]) biases[2_n] / w; b.h = exp(x[index + 3_stride]) biases[2_n+1] / h;
x[...]
- outputs of the neural networkbiases[...]
- anchorsb.w
andb.h
result width and height of bounded box that will be showed on the result imageThus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object. In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (
width=
andheight=
in the cfg-file). In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).Can someone clarify why we take the exponential of predicted widths and heights? Why not just multiple anchor coordinates with them instead of taking exponential first?
box_w[i, j, b] = anchor_w[b] exp(delta_w[i, j, b]) 32 box_h[i, j, b] = anchor_h[b] exp(delta_h[i, j, b]) 32 where i and j are the row and column in the grid (0 – 12) and b is the detector index (0 – 4) It’s OK for the predicted box to be wider and/or taller than the original image, but it does not make sense for the box to have a negative width or height. That’s why we take the exponent of the predicted number. If the predicted delta_w is smaller than 0, exp(delta_w) is a number between 0 and 1, making the box smaller than the anchor box. If delta_w is greater than 0, then exp(delta_w) a number > 1 which makes the box wider. And if delta_w is exactly 0, then exp(0) = 1 and the predicted box is exactly the same width as the anchor box. We multiply by 32 because the anchor coordinates are in the 13×13 grid, each grid cell covers 32 pixels in the 416×416 input image.
Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map): darknet/src/yolo_layer.c Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2_stride]) biases[2_n] / w; b.h = exp(x[index + 3_stride]) biases[2_n+1] / h;
x[...]
- outputs of the neural networkbiases[...]
- anchorsb.w
andb.h
result width and height of bounded box that will be showed on the result imageThus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object. In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (
width=
andheight=
in the cfg-file). In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).Thanks, but why darknet's yolov3 config file https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-voc.cfg and https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg have different input size(416 and 608), but use the same anchor size?If yolo v3 anchors are sizes of objects on the image that resized to the network size.
@weiaicunzai You are right, 2 different input size (416 and 608) cfg files have the same anchor box sizes. Seems to be a mistake. As for me, I use utilite to find anchors specific to my dataset, it increases accuracy.
According to my understanding Yolo utilizes the concept of multiscaling where it randomly switches between the input size ( Instead of always training on 416x416 it varies till 608x608 in multiples of 32 ) while training to adapt it to detect on different input sizes and maybe to satisfy this multi dimensional training they used the same anchor sizes.
In simple words they used the anchors that satisfy image of smallest dimensions as they can be used for bigger dimensions but vice versa will give errors as anchors will become larger than the image itself.
I also feel this is the reason ( mismatch in dimensions of training and testing images and default anchor) why many people are getting anchors greater than the image itself.
I don’t quite sure what is the main reason why yolo uses multiple bounding boxes for a grid cell. An answer I can find on web is for multiple aspect ratio in the prediction. However I don’t think it’s completely true because if I understand correctly, yolo optimizes the offset of a prior proportional to the size of the image, and then the box can expand to anywhere in the image, and the number of boxes doesn’t really matter. My guess is that maybe it’s a heuristic move to speed up the computation or to avoid sticking to a bad local optima. Or is there another good explanation here?
I don’t quite sure what is the main reason why yolo uses multiple bounding boxes for a grid cell. An answer I can find on web is for multiple aspect ratio in the prediction. However I don’t think it’s completely true because if I understand correctly, yolo optimizes the offset of a prior proportional to the size of the image, and then the box can expand to anywhere in the image, and the number of boxes doesn’t really matter. My guess is that maybe it’s a heuristic move to speed up the computation or to avoid sticking to a bad local optima. Or is there another good explanation here?
First of all, correct terminology would be to not interchange bounding boxes and anchors. To answer your question, YOLO uses multiple anchors as each anchor can be generalized for a particular shape/size. In a way you're absolutely right that anchor can use offset to get the detection but accuracy would be really bad as it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor. Hope it provides a gist.
First of all, correct terminology would be to not interchange bounding boxes and anchors. To answer your question, YOLO uses multiple anchors as each anchor can be generalized for a particular shape/size. In a way you're absolutely right that anchor can use offset to get the detection but accuracy would be really bad as it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor. Hope it provides a gist.
It doesn't convince me well. It should be noted that in same picture, each grid cell has a number of anchor boxes whose size will be optimized, which means intuitively, each grid cell predicts separately. Also, as I mentioned, as the offsets of anchor boxes are optimized, they can expand to whatever values that fit into the image objects, no matter which aspect ratio the objects have. So, your argument about "it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor." does not really make sense here.
First of all, correct terminology would be to not interchange bounding boxes and anchors. To answer your question, YOLO uses multiple anchors as each anchor can be generalized for a particular shape/size. In a way you're absolutely right that anchor can use offset to get the detection but accuracy would be really bad as it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor. Hope it provides a gist.
It doesn't convince me well. It should be noted that in same picture, each grid cell has a number of anchor boxes whose size will be optimized, which means intuitively, each grid cell predicts separately. Also, as I mentioned, as the offsets of anchor boxes are optimized, they can expand to whatever values that fit into the image objects, no matter which aspect ratio the objects have. So, your argument about "it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor." does not really make sense here.
The anchors are based on the dataset you have therefore the COMMON aspect ratios and size of MOST occurring classes is the size of anchors. Its true that offsets make anchors grab objects which are not of same dimensions as anchors. But the statement " They can expand to whatever values that fit into the image objects, no matter which aspect ratio the objects have." also doesnt make sense other wise there would be no need of anchors. Its more related to quantity of objects in that grid.
Another view point of anchors that might help you visualize is the fact that grids came into the picture to restrict the location as if you dont take grids and use the whole image where there are two horses (One to the left of image and another to right) will give you boxes in the middle i.e. the average in both coordinates as well as size of the bounding box. therefore you take a grid and tell it, you have 5 boxes, please use it to fill objects. if it had only one anchor it can only give one prediction. If it has 3 objects of same size, again it gives only one detection. therefore 5 anchors would give it an opportunity to get 5 objects provided they are of different sizes.
And I have related question: if I have 5 anchors (or 9 in YOLO-3), I can detect up to 5 (9) objects in each cell, yes? But what objects they(anchors) represent, if number of classes > number of anchors?
And I have related question: if I have 5 anchors (or 9 in YOLO-3), I can detect up to 5 (9) objects in each cell, yes? But what objects they(anchors) represent, if number of classes > number of anchors?
I am a little confused by your question so I will answer on what I have understood.
If you meant generally more classes than anchors; In this case, suppose you have 7 classes and 5 anchors at default. There might be two possibilities.
If you're are asking number of classes in same grid are more than anchors; Keep in mind that the only that grid is responsible to detect the object inside of which the center of the object falls into. It is highly improbable that center of each (5 in your case) detection falls inside the same grid cell (as the grids are very fine 13x13 or 19x19 by default or more finer if you want). Say hypothetically you have more than 7 classes (all of different sizes) in the same grid and only 5 anchors then you will get only 5 detections whose size or ratio will be very close to the anchors.
To have a more robust result it is important to analyze your dataset and classes and set the number of anchors. Their sizes can also be set by k-means if you wish to go a little further than the defaults. The number 5 in anchors are merely their pick as they analyzed their dataset (If I remember correctly, its PascalVOC) and found it optimum and balanced in average IOU and computation wise.
@andyrey , On the side note, I have got more intuition of YOLO from your comments on this thread and you asking questions is kind of above my head. Are you messing with me in any way ? :stuck_out_tongue_closed_eyes:
@ameeiyn Thank you for your explanations, they are valuable to understand what I misundestand! I use my own anchors, this helps improve detection, no doubts. I used the set of 9 anchors for 1 class of deformable objects (people in different poses), and set of 9 anchors for 22 classes of different symbols-but they are all the same sizes. In both cases it works fine after using own set of anchors. But I never tried the case formulated above- many classes with very different sizes and ratios. I believe in this case k-means clustering subroutine just uses all dataset mixed over all classes to generate single optimal set of anchors, which are used to find "any class object" with some confidence> threshold firstly. It is my understanding, may be I am wrong.
@andyrey Thank You and Welcome. I agree with your understanding :100:
If my input dimension is 224x224, then can I use the same anchor sizes in the cfg (like 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326), or do I need to change it? Also, if I have to change that, then will linear scaling work?
Can someone explain to me how the ground truth tensors are constructed in, for example, YOLO3? Do we use anchor boxes' values in this process?
If I have an 416x416 image and 80 classes, I understand that I (or some script) have to construct 3 ground truth tensors: 13x13x255, 26x26x255, 52x52x255. An 1x1x255 vector for a cell containg an object center would have 3 1x1x85 parts. Each of this parts 'corresponds' to one anchor box.
So, what do I do next? How do I specify the (x, y, w, h) values in each of this 3 1x1x85 parts? Are anchor boxes' values which are determined on the dataset used for obtaining (x, y, w, h) prior values? Or only the ground truth boxes' values from the images?
Can anyone explain the process flow since I am getting different concepts from different sources. I am not clear if Yolo first divides the images into n x n grids and then does the image classification or it classifies the object in one pass. So it will be very helpful if someone explains the process from starting.
Hi all,
We're struggling to get our Yolov3 working for a 2 class detection problem (the size of the objects of both classes are varying and similar, generally small, and the size itself does not help differentiating the object type). We think that the training is not working due to some problem with the anchor boxes, since we can clearly see that depending on the assigned anchor values the yolo_output_0, yolo_output_1 or yolo_output_2 fail to return a loss value different to 0 (for xy, hw and class components). However, even if there are multiple threads about anchor boxes we cannot find a clear explanation about how they are assigned specifically for YOLOv3.
So far, what we're doing to know the size of the boxes is: 1- We run a clustering method on the normalized ground truth bounding boxes (according to the original size of the image) and get the centroids of the clusters. In our case, we have 2 clusters and the centroids are something about (0.087, 0.052) and (0.178, 0.099). 2- Then we rescale the values according to the rescaling we are going to apply to the images during training. We are working with rectangular images of (256, 416), so we get bounding boxes of (22,22) and (46,42). Note that we have rounded the values as we have read that yoloV3 expects actual pixel values. 3- Since we compute anchors at 3 different scales (3 skip connections), the previous anchor values will correspond to the large scale (52). The anchors for the other two scales (13 and 26) are calculated by dividing the first ancho /2 and /4.
We are not even sure if we are correct up to this point. If we look at the code in the original models.py what we see is the following:
yolo_anchors = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326)], np.float32) / 416 yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])
So, there are 9 anchors, which are ordered from smaller to larger and the, the anchor_masks determine if the resolution at which they are used, is this correct? In fact, our first question is, are they 9 anchors or 3 anchors at 3 different scales? If so, how are they calculated? we know about the gen_anchors script in yolo_v2 and a similar script in yolov3, however we don't know if they calculate 9 clusters and then order them according to the size or if they follow a procedure similar to ours.
Additionally, we don’t fully understand why these boxes are divided by 416 (image size). This would mean having anchors that are not integers (pixels values), which was stated was necessary for yolov3.
We would be really grateful if someone could provide us with some insight into these questions and help us better understanding how yoloV3 performs.
Thanks and regards Karen
Why do you use 2 clusters for your dataset? In YOLO-3 you can prepare 9 anchors, regardless class number. Each of the scale of net uses 3 of them (3x3=9). Look at line mask = 0,1,2 , then mask = 3,4,5, and mask = 6,7,8 in cfg file.
When you say small
can you quantify that? From experience I can say that YOLO V2/3 is not great on images below 35x35 pixels.
Why do you use 2 clusters for your dataset? In YOLO-3 you can prepare 9 anchors, regardless class number. Each of the scale of net uses 3 of them (3x3=9). Look at line mask = 0,1,2 , then mask = 3,4,5, and mask = 6,7,8 in cfg file.
Thanks for your response. We use 2 because if we look at our data the sizes of our bounding boxes can be clustered into 2 groups, even in one would be enough, so we don't need to use 3 of them. We do not set 2 anchor boxes because of the number of classes.
When you say
small
can you quantify that? From experience I can say that YOLO V2/3 is not great on images below 35x35 pixels.
Hi Sauraus, thanks for your response. The original size of our images is something about (2000-5000)x(4800-7000), and the average size of the object bounding boxes are 300x300. Do you think this is a problem?
I know this might be too simple for many of you. But I can not seem to find a good literature illustrating clearly and definitely for the idea and concept of anchor box in Yolo (V1,V2, andV3). Thanks!