pjreddie / darknet

Convolutional Neural Networks
http://pjreddie.com/darknet/
Other
25.72k stars 21.32k forks source link

Can someone clarify the anchor box concept used in Yolo? #568

Open hbzhang opened 6 years ago

hbzhang commented 6 years ago

I know this might be too simple for many of you. But I can not seem to find a good literature illustrating clearly and definitely for the idea and concept of anchor box in Yolo (V1,V2, andV3). Thanks!

andyrey commented 4 years ago

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

klopezlinar commented 4 years ago

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

No, they don't differ in size, they differ in content/appearance

Sauraus commented 4 years ago

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

No, they don't differ in size, they differ in content/appearance

Content = class (cat/dog/horse etc.) Appearance = variance in class (black/red/brown cat)

Is that how you classify those words? :)

klopezlinar commented 4 years ago

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

No, they don't differ in size, they differ in content/appearance

Content = class (cat/dog/horse etc.) Appearance = variance in class (black/red/brown cat)

Is that how you classify those words? :)

We have breast masses, some of the malignant, some of them benign. Our classes then are "malignant" and "benign"

andyrey commented 4 years ago

Does it mean you deal with gray-scale picture, with content occupying whole picture area, so that you have to classify structure of the tissue, without detection of some compact objects on it? Can you refer to such pictures?

klopezlinar commented 4 years ago

Does it mean you deal with gray-scale picture, with content occupying whole picture area, so that you have to classify structure of the tissue, without detection of some compact objects on it? Can you refer to such pictures?

yes, they are grayscale images (we have already changes de code for 1 channel). The content usually occupies half image, so we are also trying to crop it in order to reduce the amount of background. The objects to detect are masses, sometimes compact, sometimes more disperse. Then, from a clinical point of view according to some characteristics of the masses (borders, density, shape..) they are classified as malignant or benign. Here you have some sample images (resized to 216*416):

imagen

imagen

andyrey commented 4 years ago

These objects (tumors) can be different size. So you shouldn't restrict with 2 anchor sizes, but use as much as possible, that is 9 in our case. If this is redundant, clustering program would yield 9 closely sized anchors, it is not a problem. What is more important, this channel probably not 8-bit, but deeper, and quantifying from 16 to 8 may lose valuable information. Or may be split 16-bit into two different channels- I don't know, but this is issue to think off...

klopezlinar commented 4 years ago

These objects (tumors) can be different size. So you shouldn't restrict with 2 anchor sizes, but use as much as possible, that is 9 in our case. If this is redundant, clustering program would yield 9 closely sized anchors, it is not a problem. What is more important, this channel probably not 8-bit, but deeper, and quantifying from 16 to 8 may lose valuable information. Or may be split 16-bit into two different channels- I don't know, but this is issue to think off...

Ok, we will try with the 9 anchors. Regarding the 16-bit, we are using tf2 so that's not a problem I think...

Now we are able to detect some masses but when the we lower the score_threshold in the detection.

ameeiyn commented 4 years ago

So far, what we're doing to know the size of the boxes is: 1- We run a clustering method on the normalized ground truth bounding boxes (according to the original size of the image) and get the centroids of the clusters. In our case, we have 2 clusters and the centroids are something about (0.087, 0.052) and (0.178, 0.099). 2- Then we rescale the values according to the rescaling we are going to apply to the images during training. We are working with rectangular images of (256, 416), so we get bounding boxes of (22,22) and (46,42). Note that we have rounded the values as we have read that yoloV3 expects actual pixel values. 3- Since we compute anchors at 3 different scales (3 skip connections), the previous anchor values will correspond to the large scale (52). The anchors for the other two scales (13 and 26) are calculated by dividing the first ancho /2 and /4.

First of all Sorry to join the party late. From what I understand here, you have two classes Malignant and Benign which are merely the output classes but doesn't necessarily have to be of the same size (in dimensions of the bounding boxes) and therefore (as @andyrey suggested) I would suggest to either use the default number and sizes of anchors or run k-means on your dataset to obtain the best sizes for the anchors and best numbers. I am not sure about the sizes but you can increase the number of anchors at least as the images might have different ratios (even if he tumours are of the same size which again might not be the case) and I think would be favourable for your application.

Are all the input images of fixed dimensions ie. (256x416) ? You have also suggested two bounding boxes of (22,22) and (46,42). are the bounding boxes always of these dimensions ? If so there might be something wrong as they may start from those values but should be able to form the box around tumours as tightly as possible. Need more clarification.

Although there is a possibility you might get results but I am not quite sure if YOLO is the perfect algorithm that works on non-rgb. Its quite been some time since I have worked with YOLO and referred the theoretical scripts and papers so I am not quite sure but I would suggest you to first test it by training on your dataset without making a lot of changes and then finetune by making changes to get more accuracy if you receive some promising results in the first case.

pra-dan commented 4 years ago

@ameeiyn @andyrey Thanks for clarifying on the getting w and h from predictions and anchor values. I think I have got the box w and h successfully using the

box_w = anchor_sets[anchor_index] * exp(offset_w) * 32
box_h = anchor_sets[anchor_index+1] * exp(offset_h) * 32

where offset_whatever is the predicted value of w and h. But I for obtaining the x and y values of the bounding boxes, I am simply multipluing the predicted coordinates (x and y) with image width and height. I am getting poor predictions as well as dislocated boxes: Screenshot from 2020-04-28 17-33-07

Can you guys

kindly help ?

mark198181 commented 4 years ago

I want to learn to please

easy-and-simple commented 4 years ago

Your explanations are useless like your existence obviously Only real morons would explain pictures with words instead to write them Is there normal humans that can write few pictures of how anchors look and work?

apd888 commented 4 years ago

This may be fundamental: what if I train the network for an object in location (x,y), but detect the same object located in (x+10, y) in a picture ? How can YOLO detect the physical location?

rajan780 commented 3 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

https://github.com/pjreddie/darknet/blob/6f6e4754ba99e42ea0870eb6ec878e48a9f7e7ae/src/yolo_layer.c#L88-L89

  • x[...] - outputs of the neural network
  • biases[...] - anchors
  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Hi @AlexeyAB I understand that yolo v3 anchors are sizes of objects on the image that resized to network size while in yolov2 anchors are sizes of objects relative to the final feature map. but my question is: the way of calculating width and height is same for both yolov3 and yolov2 for example: width =e^(tw)pw and height = e^(th)ph. Then why yolov3 uses anchors of network size and yolov2 uses anchors of final feature map.

Vedant1parekh commented 3 years ago

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

https://github.com/pjreddie/darknet/blob/6f6e4754ba99e42ea0870eb6ec878e48a9f7e7ae/src/yolo_layer.c#L88-L89

  • x[...] - outputs of the neural network
  • biases[...] - anchors
  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

How many number of anchor boxes needed in yolov4?

mathemaphysics commented 2 years ago

So far, reading through this whole three-year long thread, I've concluded that it's probably best just to re-read the papers. There are diagrams in the papers. Both in this and for the most part in the papers it is not made clear whether the anchor boxes are (x, y, w, h) in the input image or in the output feature layers (plural for skip connections).

I'm seeing no connection made between the input and the output of the network at all whatsoever. Literally everything else from batch normalization to internal covariate shift makes sense to me. The anchor boxes don't. It would really help to have a better summary.