Can someone clarify the anchor box concept used in Yolo?

hbzhang commented 6 years ago

I know this might be too simple for many of you. But I can not seem to find a good literature illustrating clearly and definitely for the idea and concept of anchor box in Yolo (V1,V2, andV3). Thanks!

vkmenon commented 6 years ago

Here's a quick explanation based on what I understand (which might be wrong but hopefully gets the gist of it). After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes.

AlexeyAB commented 6 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map): https://github.com/pjreddie/darknet/blob/6f6e4754ba99e42ea0870eb6ec878e48a9f7e7ae/src/yolo_layer.c#L88-L89

x[...] - outputs of the neural network
biases[...] - anchors
b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

hbzhang commented 6 years ago

Thanks!

spinoza1791 commented 6 years ago

For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?

CageCode commented 6 years ago

For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?

I was wondering the same. The more anchors used, the higher the IoU; see (https://medium.com/@vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807). However, when you try to detect one class, which often show the same object aspect ratios (like faces) I don't think that increasing the number of anchors is going to increase the IoU by a lot. While the computational overhead is going to increase significantly.

fkoorc commented 6 years ago

I used YOLOv2 to predict some industry meter board few weeks ago and I try the same idea spinoza1791 and CageCode refered, The reason was that I need high accuracy but also want close to real time so I thought change num of anchors (YOLOv2 -> 5) but it all end to crush after about 1800 iteration So I might lose someing there

frozenscrypt commented 6 years ago

@AlexeyAB How do you get the initial anchor box dimensions after clustering? The width and height after clustering are all number s less than 1, but anchor box dimensions are greater of less than 1. How to get the anchor box dimensions?

saiteja011 commented 6 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2stride]) biases[2n] / w; b.h = exp(x[index + 3stride]) biases[2n+1] / h;
* `x[...]` - outputs of the neural network

* `biases[...]` - anchors

* `b.w` and `b.h` result width and height of bounded box that will be showed on the result image
Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

great explanation bro. thank you.

andyrey commented 5 years ago

Sorry, still unclear phrase In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map What are "final feature map" sizes? For yolo-voc.2.0.cfg input image size is 416x416, anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52. I got- each pair represents anchor width and height, centered in every of 13X13 cells. The last anchor- 16.62 (width?), 10.52(height?)-what units they are? Can somebody explain litterally with this example? And, may be, someone uploaded the code for deducing best anchors from given dataset with K-means?

fkoorc commented 5 years ago

I think maybe your anchor has some error. In yolo2 the anchor size is based on final feature map(13x13) as you said. So the anchor aspect ratio must be smaller than 13x13 But in yolo3 the author changed anchor size based on initial input image size. As author said: "In YOLOv3 anchor sizes are actual pixel values. this simplifies a lot of stuff and was only a little bit harder to implement" Hope I am not missing anything :)

jalaldev1980 commented 5 years ago

Dears,

is it necessarily to get the anchors values before the training to enhance the model?

I am building my own data set to detect 6 classes using tiny yolov2 and I used the below code to get anchors values do I need to change the width and height if I am changing it in the cfg file ?

are the below anchors accepted or the values are huge values ? what is the num_of_clusters 9 ?

....\build\darknet\x64>darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416

num_of_clusters = 9, width = 416, height = 416 read labels from 8297 images loaded image: 2137 box: 7411

Wrong label: data/obj/IMG_0631.txt - j = 0, x = 1.332292, y = 1.399537, width = 0.177083, height = 0.412037 loaded image: 2138 box: 7412 calculating k-means++ ...

avg IoU = 59.41 %

Saving anchors to the file: anchors.txt anchors = 19.2590,25.4234, 42.6678,64.3841, 36.4643,117.4917, 34.0644,235.9870, 47.0470,171.9500, 220.3569,59.5293, 48.2070,329.3734, 99.0149,240.3936, 165.5850,351.2881

fkoorc commented 5 years ago

To get anchor value first makes training time faster but not necessary tiny yolo is not quite accuracy if you can I adjust you use yolov2

andyrey commented 5 years ago

@jalaldev1980 I try to guess, where did you take this _calcanchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?

developer0hye commented 5 years ago

@jalaldev1980 I try to guess, where did you take this _calcanchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?

./darknet detector calc_anchors your_obj.data -num_of_clusters 9 -width 416 -height 416

jalaldev1980 commented 5 years ago

check How to improve object detection section at https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

for tiny yolo check the comments at https://github.com/pjreddie/darknet/issues/911

let me know if you find any other resources and advices

On Wed, 21 Nov 2018 at 09:34, andyrey notifications@github.com wrote:

@jalaldev1980 https://github.com/jalaldev1980 I try to guess, where did you take this calc_anchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pjreddie/darknet/issues/568#issuecomment-440536980, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq5IBlNGUlzAo6_rYn4j0sN6gOXWFiayks5uxOX7gaJpZM4S7tc_ .

NadimSKanaan commented 5 years ago

Can someone provide some insights into YOLOv3's time complexity if we change the number of anchors?

CoinCheung commented 5 years ago

Hi guys,

I got to know that yolo3 employs 9 anchors, but there are three layers used to generate yolo targets. Does this mean, each yolo target layer should have 3 anchors at each feature point according to their scale as does in FPN, or do we need to match all 9 anchors with one gt on all the 3 yolo output layers?

andyrey commented 5 years ago

I use single set of 9 anchors for all of 3 layers in cfg file, it works fine. I believe, this set is for one base scale, and rescaled in the other 2 layers somewhere in framework code. Let someone correct me, if I am wrong.

weiaicunzai commented 5 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2stride]) biases[2n] / w; b.h = exp(x[index + 3stride]) biases[2n+1] / h;

x[...] - outputs of the neural network

biases[...] - anchors

b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Thanks, but why darknet's yolov3 config file https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-voc.cfg and https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg have different input size(416 and 608), but use the same anchor size?If yolo v3 anchors are sizes of objects on the image that resized to the network size.

andyrey commented 5 years ago

@weiaicunzai You are right, 2 different input size (416 and 608) cfg files have the same anchor box sizes. Seems to be a mistake. As for me, I use utilite to find anchors specific to my dataset, it increases accuracy.

CoinCheung commented 5 years ago

Hi, Here I have some anchor question please: If I did not misunderstand the paper, there is also a positive-negative mechanism in yolov3, but only when we compute confidence loss, since xywh and classification only rely on the best match. Thus the xywh loss and classification loss are computed with gt and only one associated match. As for the confidence, the division of positive and negative is based on the iou value. Here my question is: is this iou computed between gt and the anchors, or between gt and the predictions which are computed from anchor and the model outputs(output is the offset generated from the model)?

Sauraus commented 5 years ago

Say I have a situation where all my objects that I need to detect are of the same size 30x30 pixels on an image that is 295x295 pixels, how would I go about calculating the best anchors for yolo v2 to use during training?

andyrey commented 5 years ago

@Sauraus There is special python program, see AlexeyAB reference on github, which calculates 5 best anchors based on your dataset variety(for YOLO-2). Very easy to use. Then replace string with new anchor boxes in your cfg file. If you have same size objects, it probably would give you set of same pair of digits.

Sauraus commented 5 years ago

@andyrey are you referring to this: https://github.com/AlexeyAB/darknet/blob/master/scripts/gen_anchors.py by any chance?

andyrey commented 5 years ago

@Sauraus: Yes, I used this for YOLO-2 with cmd: python gen_anchors.py -filelist train.txt -output_dir ./ -num_clusters 5

and for 9 anchors for YOLO-3 I used C-language darknet: darknet3.exe detector calc_anchors obj.data -num_of_clusters 9 -width 416 -height 416 -showpause

pkhigh commented 5 years ago

Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?

Sauraus commented 5 years ago

Yes and it's driving me crazy.

Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?

fkoorc commented 5 years ago

I think that the bounding box is hard to precisely fit your target There is always some deviation, just how much the degree of error it is. If the error is very large maybe you should check your training data and test data But still there is so many possible reason cause that Maybe you can post your picture?

atulshanbhag commented 5 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475

b.w = exp(x[index + 2stride]) biases[2n] / w; b.h = exp(x[index + 3stride]) biases[2n+1] / h;

x[...] - outputs of the neural network

biases[...] - anchors

b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Can someone clarify why we take the exponential of predicted widths and heights? Why not just multiple anchor coordinates with them instead of taking exponential first?

jtlz2 commented 5 years ago

Extremely useful discussion - thanks all - have been trying to understand Azure Cognitive Services / Microsoft Custom Vision object detection. I had been wondering where their exported anchor values came from. It's now fairly clear they do transfer learning off YOLO.

jalaldev1980 commented 5 years ago

I am training my objects using YOLO v3 9 anchors 5 objects 45000 iterations (4 days) the system is not detecting any objects !

any thought ?

On Thu, 28 Feb 2019 at 16:38, jtlz2 notifications@github.com wrote:

Extremely useful discussion - thanks all - have been trying to understand Azure Cognitive Services / Microsoft Custom Vision object detection. I had been wondering where their exported anchor values came from. It's now fairly clear they do transfer learning off YOLO.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pjreddie/darknet/issues/568#issuecomment-468257558, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq5IBkTnLa8x1zv3sDnGNK6MDnJb6KiTks5vR83VgaJpZM4S7tc_ .

ameeiyn commented 5 years ago

@AlexeyAB How do you get the initial anchor box dimensions after clustering? The width and height after clustering are all number s less than 1, but anchor box dimensions are greater of less than 1. How to get the anchor box dimensions?

YOLO's anchors are specific to dataset that is trained on (default set is based on PASCAL VOC). They ran a k-means clustering on the normalized width and height of the ground truth bounding boxes and obtained 5 values.

The final values are based on not coordinates but grid values. YOLO default set: anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071 this means the height and width of first anchor is slightly over one grid cell [1.3221, 1.73145] and the last anchor almost covers the whole image [11.2364, 10.0071] considering the image is 13x13 grid.

Hope this gives you a bit clearer idea if not complete.

ameeiyn commented 5 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map): darknet/src/yolo_layer.c Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2_stride]) biases[2_n] / w; b.h = exp(x[index + 3_stride]) biases[2_n+1] / h;

x[...] - outputs of the neural network

biases[...] - anchors

b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object. In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file). In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Can someone clarify why we take the exponential of predicted widths and heights? Why not just multiple anchor coordinates with them instead of taking exponential first?

box_w[i, j, b] = anchor_w[b] exp(delta_w[i, j, b]) 32 box_h[i, j, b] = anchor_h[b] exp(delta_h[i, j, b]) 32 where i and j are the row and column in the grid (0 – 12) and b is the detector index (0 – 4) It’s OK for the predicted box to be wider and/or taller than the original image, but it does not make sense for the box to have a negative width or height. That’s why we take the exponent of the predicted number. If the predicted delta_w is smaller than 0, exp(delta_w) is a number between 0 and 1, making the box smaller than the anchor box. If delta_w is greater than 0, then exp(delta_w) a number > 1 which makes the box wider. And if delta_w is exactly 0, then exp(0) = 1 and the predicted box is exactly the same width as the anchor box. We multiply by 32 because the anchor coordinates are in the 13×13 grid, each grid cell covers 32 pixels in the 416×416 input image.

ameeiyn commented 5 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map): darknet/src/yolo_layer.c Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2_stride]) biases[2_n] / w; b.h = exp(x[index + 3_stride]) biases[2_n+1] / h;

x[...] - outputs of the neural network

biases[...] - anchors

b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object. In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file). In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Thanks, but why darknet's yolov3 config file https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-voc.cfg and https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg have different input size(416 and 608), but use the same anchor size?If yolo v3 anchors are sizes of objects on the image that resized to the network size.

@weiaicunzai You are right, 2 different input size (416 and 608) cfg files have the same anchor box sizes. Seems to be a mistake. As for me, I use utilite to find anchors specific to my dataset, it increases accuracy.

According to my understanding Yolo utilizes the concept of multiscaling where it randomly switches between the input size ( Instead of always training on 416x416 it varies till 608x608 in multiples of 32 ) while training to adapt it to detect on different input sizes and maybe to satisfy this multi dimensional training they used the same anchor sizes.

In simple words they used the anchors that satisfy image of smallest dimensions as they can be used for bigger dimensions but vice versa will give errors as anchors will become larger than the image itself.

I also feel this is the reason ( mismatch in dimensions of training and testing images and default anchor) why many people are getting anchors greater than the image itself.

zrion commented 5 years ago

I don’t quite sure what is the main reason why yolo uses multiple bounding boxes for a grid cell. An answer I can find on web is for multiple aspect ratio in the prediction. However I don’t think it’s completely true because if I understand correctly, yolo optimizes the offset of a prior proportional to the size of the image, and then the box can expand to anywhere in the image, and the number of boxes doesn’t really matter. My guess is that maybe it’s a heuristic move to speed up the computation or to avoid sticking to a bad local optima. Or is there another good explanation here?

ameeiyn commented 5 years ago

I don’t quite sure what is the main reason why yolo uses multiple bounding boxes for a grid cell. An answer I can find on web is for multiple aspect ratio in the prediction. However I don’t think it’s completely true because if I understand correctly, yolo optimizes the offset of a prior proportional to the size of the image, and then the box can expand to anywhere in the image, and the number of boxes doesn’t really matter. My guess is that maybe it’s a heuristic move to speed up the computation or to avoid sticking to a bad local optima. Or is there another good explanation here?

First of all, correct terminology would be to not interchange bounding boxes and anchors. To answer your question, YOLO uses multiple anchors as each anchor can be generalized for a particular shape/size. In a way you're absolutely right that anchor can use offset to get the detection but accuracy would be really bad as it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor. Hope it provides a gist.

zrion commented 5 years ago

First of all, correct terminology would be to not interchange bounding boxes and anchors. To answer your question, YOLO uses multiple anchors as each anchor can be generalized for a particular shape/size. In a way you're absolutely right that anchor can use offset to get the detection but accuracy would be really bad as it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor. Hope it provides a gist.

It doesn't convince me well. It should be noted that in same picture, each grid cell has a number of anchor boxes whose size will be optimized, which means intuitively, each grid cell predicts separately. Also, as I mentioned, as the offsets of anchor boxes are optimized, they can expand to whatever values that fit into the image objects, no matter which aspect ratio the objects have. So, your argument about "it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor." does not really make sense here.

ameeiyn commented 5 years ago

First of all, correct terminology would be to not interchange bounding boxes and anchors. To answer your question, YOLO uses multiple anchors as each anchor can be generalized for a particular shape/size. In a way you're absolutely right that anchor can use offset to get the detection but accuracy would be really bad as it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor. Hope it provides a gist.

It doesn't convince me well. It should be noted that in same picture, each grid cell has a number of anchor boxes whose size will be optimized, which means intuitively, each grid cell predicts separately. Also, as I mentioned, as the offsets of anchor boxes are optimized, they can expand to whatever values that fit into the image objects, no matter which aspect ratio the objects have. So, your argument about "it has to generalize (in same picture) objects as different as a ball in the hand of a kid and an Eiffel tower has to be covered by a single anchor." does not really make sense here.

The anchors are based on the dataset you have therefore the COMMON aspect ratios and size of MOST occurring classes is the size of anchors. Its true that offsets make anchors grab objects which are not of same dimensions as anchors. But the statement " They can expand to whatever values that fit into the image objects, no matter which aspect ratio the objects have." also doesnt make sense other wise there would be no need of anchors. Its more related to quantity of objects in that grid.

Another view point of anchors that might help you visualize is the fact that grids came into the picture to restrict the location as if you dont take grids and use the whole image where there are two horses (One to the left of image and another to right) will give you boxes in the middle i.e. the average in both coordinates as well as size of the bounding box. therefore you take a grid and tell it, you have 5 boxes, please use it to fill objects. if it had only one anchor it can only give one prediction. If it has 3 objects of same size, again it gives only one detection. therefore 5 anchors would give it an opportunity to get 5 objects provided they are of different sizes.

andyrey commented 5 years ago

And I have related question: if I have 5 anchors (or 9 in YOLO-3), I can detect up to 5 (9) objects in each cell, yes? But what objects they(anchors) represent, if number of classes > number of anchors?

ameeiyn commented 5 years ago

And I have related question: if I have 5 anchors (or 9 in YOLO-3), I can detect up to 5 (9) objects in each cell, yes? But what objects they(anchors) represent, if number of classes > number of anchors?

I am a little confused by your question so I will answer on what I have understood.

If you meant generally more classes than anchors; In this case, suppose you have 7 classes and 5 anchors at default. There might be two possibilities.

You have only 4 unique sizes (say [2,2] [3,4] [6,11] [9,15] are able to cover all the 7 classes). if this is the case there is no issue whatsoever as you have ample anchors available.
All 7 classes are of different sizes and you have only 5 anchors, It will adapt closest to 5 classes (I am not sure how it will decide, maybe it chooses the classes with the most number of labeled data available) and when it comes to remaining two it can just increase the offset to adapt to those detections. In this case it finally goes to number of object centers in a single grid.

If you're are asking number of classes in same grid are more than anchors; Keep in mind that the only that grid is responsible to detect the object inside of which the center of the object falls into. It is highly improbable that center of each (5 in your case) detection falls inside the same grid cell (as the grids are very fine 13x13 or 19x19 by default or more finer if you want). Say hypothetically you have more than 7 classes (all of different sizes) in the same grid and only 5 anchors then you will get only 5 detections whose size or ratio will be very close to the anchors.

To have a more robust result it is important to analyze your dataset and classes and set the number of anchors. Their sizes can also be set by k-means if you wish to go a little further than the defaults. The number 5 in anchors are merely their pick as they analyzed their dataset (If I remember correctly, its PascalVOC) and found it optimum and balanced in average IOU and computation wise.

@andyrey , On the side note, I have got more intuition of YOLO from your comments on this thread and you asking questions is kind of above my head. Are you messing with me in any way ? :stuck_out_tongue_closed_eyes:

andyrey commented 5 years ago

@ameeiyn Thank you for your explanations, they are valuable to understand what I misundestand! I use my own anchors, this helps improve detection, no doubts. I used the set of 9 anchors for 1 class of deformable objects (people in different poses), and set of 9 anchors for 22 classes of different symbols-but they are all the same sizes. In both cases it works fine after using own set of anchors. But I never tried the case formulated above- many classes with very different sizes and ratios. I believe in this case k-means clustering subroutine just uses all dataset mixed over all classes to generate single optimal set of anchors, which are used to find "any class object" with some confidence> threshold firstly. It is my understanding, may be I am wrong.

ameeiyn commented 5 years ago

@andyrey Thank You and Welcome. I agree with your understanding :100:

hvudeshi commented 5 years ago

If my input dimension is 224x224, then can I use the same anchor sizes in the cfg (like 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326), or do I need to change it? Also, if I have to change that, then will linear scaling work?

tabmoo commented 4 years ago

Can someone explain to me how the ground truth tensors are constructed in, for example, YOLO3? Do we use anchor boxes' values in this process?

If I have an 416x416 image and 80 classes, I understand that I (or some script) have to construct 3 ground truth tensors: 13x13x255, 26x26x255, 52x52x255. An 1x1x255 vector for a cell containg an object center would have 3 1x1x85 parts. Each of this parts 'corresponds' to one anchor box.

So, what do I do next? How do I specify the (x, y, w, h) values in each of this 3 1x1x85 parts? Are anchor boxes' values which are determined on the dataset used for obtaining (x, y, w, h) prior values? Or only the ground truth boxes' values from the images?

gouravsinghbais commented 4 years ago

Can anyone explain the process flow since I am getting different concepts from different sources. I am not clear if Yolo first divides the images into n x n grids and then does the image classification or it classifies the object in one pass. So it will be very helpful if someone explains the process from starting.

klopezlinar commented 4 years ago

Hi all,

We're struggling to get our Yolov3 working for a 2 class detection problem (the size of the objects of both classes are varying and similar, generally small, and the size itself does not help differentiating the object type). We think that the training is not working due to some problem with the anchor boxes, since we can clearly see that depending on the assigned anchor values the yolo_output_0, yolo_output_1 or yolo_output_2 fail to return a loss value different to 0 (for xy, hw and class components). However, even if there are multiple threads about anchor boxes we cannot find a clear explanation about how they are assigned specifically for YOLOv3.

So far, what we're doing to know the size of the boxes is: 1- We run a clustering method on the normalized ground truth bounding boxes (according to the original size of the image) and get the centroids of the clusters. In our case, we have 2 clusters and the centroids are something about (0.087, 0.052) and (0.178, 0.099). 2- Then we rescale the values according to the rescaling we are going to apply to the images during training. We are working with rectangular images of (256, 416), so we get bounding boxes of (22,22) and (46,42). Note that we have rounded the values as we have read that yoloV3 expects actual pixel values. 3- Since we compute anchors at 3 different scales (3 skip connections), the previous anchor values will correspond to the large scale (52). The anchors for the other two scales (13 and 26) are calculated by dividing the first ancho /2 and /4.

We are not even sure if we are correct up to this point. If we look at the code in the original models.py what we see is the following:

yolo_anchors = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326)], np.float32) / 416 yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])

So, there are 9 anchors, which are ordered from smaller to larger and the, the anchor_masks determine if the resolution at which they are used, is this correct? In fact, our first question is, are they 9 anchors or 3 anchors at 3 different scales? If so, how are they calculated? we know about the gen_anchors script in yolo_v2 and a similar script in yolov3, however we don't know if they calculate 9 clusters and then order them according to the size or if they follow a procedure similar to ours.

Additionally, we don’t fully understand why these boxes are divided by 416 (image size). This would mean having anchors that are not integers (pixels values), which was stated was necessary for yolov3.

We would be really grateful if someone could provide us with some insight into these questions and help us better understanding how yoloV3 performs.

Thanks and regards Karen

andyrey commented 4 years ago

Why do you use 2 clusters for your dataset? In YOLO-3 you can prepare 9 anchors, regardless class number. Each of the scale of net uses 3 of them (3x3=9). Look at line mask = 0,1,2 , then mask = 3,4,5, and mask = 6,7,8 in cfg file.

Sauraus commented 4 years ago

When you say small can you quantify that? From experience I can say that YOLO V2/3 is not great on images below 35x35 pixels.

klopezlinar commented 4 years ago

Why do you use 2 clusters for your dataset? In YOLO-3 you can prepare 9 anchors, regardless class number. Each of the scale of net uses 3 of them (3x3=9). Look at line mask = 0,1,2 , then mask = 3,4,5, and mask = 6,7,8 in cfg file.

Thanks for your response. We use 2 because if we look at our data the sizes of our bounding boxes can be clustered into 2 groups, even in one would be enough, so we don't need to use 3 of them. We do not set 2 anchor boxes because of the number of classes.

klopezlinar commented 4 years ago

When you say small can you quantify that? From experience I can say that YOLO V2/3 is not great on images below 35x35 pixels.

Hi Sauraus, thanks for your response. The original size of our images is something about (2000-5000)x(4800-7000), and the average size of the object bounding boxes are 300x300. Do you think this is a problem?

pjreddie / darknet

Can someone clarify the anchor box concept used in Yolo? #568