ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.75k stars 16.34k forks source link

Want to figure out critical algorithm of Detect layer #471

Closed TaoXieSZ closed 4 years ago

TaoXieSZ commented 4 years ago

❔Question

Hi, I want to figure out the intuition of bbox detection. In yolov3, we can find that the output can be write by these: image image

So, in yolov5, I look into the src code: https://github.com/ultralytics/yolov5/blob/1e95337f3aec4c12244802bb6e493b07b27aa795/models/yolo.py#L21-L38 And try to formularize it: image

Am I right?

glenn-jocher commented 4 years ago

@ChristopherSTAN yes this looks correct! Typically this would be written as 2sigma() rather than sigma() x 2 though.

TaoXieSZ commented 4 years ago

@glenn-jocher Awesome! How you find out out this way to get the prediction. It is so brilliant. image

glenn-jocher commented 4 years ago

The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. yolov3 suffers from this problem as well as yolov4.

For yolov5 I made sure to patch this error by sigmoiding all model outputs, while also ensuring that the centerpoint remained unchanged 1=fcn(0), so nominal zero outputs from the model would cause the nominal anchor size to be used. The current eqn constrains anchor multiples from a minimum of 0 to a maximum of 4, and the anchor-target matching has also been updated to be width-height multiple based, with a nominal upper threshold hyperparameter of 4.0.

The original thread is https://github.com/ultralytics/yolov3/issues/168 image

glenn-jocher commented 4 years ago

@ChristopherSTAN BTW, you mentioned you were experimenting with lowering hyp['anchor_t']: 4.0, # anchor-multiple threshold paired with an increase in anchor count. This is an interesting approach, but I just realized it would make sense to take this a step further and modify the actual wh function as well to reduce the range from 0-4 to 0-2, otherwise half of your output space is unused, which is a bad design decision, as your neuron outputs may lose up to half of their precision capability.

You can accomplish this by modifying the exponent in the equation to 1.0, which is mathematically equivalent to removing it altogether:

y[..., 2:4] = (y[..., 2:4] * 2) ** 1.0 * self.anchor_grid[i]  # wh
            = y[..., 2:4] * 2 * self.anchor_grid[i]  # wh 

This change would need to occur in two places: 1) Detect() module, 2) compute_loss() box calculation: https://github.com/ultralytics/yolov5/blob/1e95337f3aec4c12244802bb6e493b07b27aa795/utils/utils.py#L472-L475

TaoXieSZ commented 4 years ago

@glenn-jocher I am afraid I have not considered so much LOL. Maybe you are talking about another DL pro.

(Apparently I am not, for now.)

But I will try! Thanks for your explanation.

TaoXieSZ commented 4 years ago

@glenn-jocher I follow your idea, and set hyp['anchor_t'] = 3.0, will it work?

glenn-jocher commented 4 years ago

@ChristopherSTAN don't worry, the idea is pretty simple. A neuron can control outputs in a certain range defined by the above equations, default being 0-4. If you reduce the hyperparameter that controls the matching threshold to 2.0, then boxes are only matched to anchors that are less then 2x the anchor size and greater than 1/2x the anchor size. So if an anchor size is 10 pixels, then that neuron can match labels between 5-20 pixels size, but it can output a box shape from 0-40 pixels size. So it is wasting 5/8 of it's output span. It has to fit all of it's output between 5-20, which by definition gives it less fine control for tiny corrections, which will reduce mAP.

So for best results, you want the neuron to have output authority over the entire training space you want it to predict. Even with the default settings, I see I am wasting a bit of training space. With default settings, the 10 pixel anchor neuron can output sizes between 2.5 - 40, so I am currently wasting 6% of the output space.

glenn-jocher commented 4 years ago

@glenn-jocher I follow your idea, and set hyp['anchor_t'] = 3.0, will it work?

Yes, any value will work here, you just need to experiment with what produces the best mAP. If you lower these values though then it would also make sense to adjust the wh equations. For a 3.0 limit you might adjust the equation to this to fully capture the output space: y[..., 2:4] = (y[..., 2:4] * 2) ** 1.6 * self.anchor_grid[i] # wh

TaoXieSZ commented 4 years ago

@glenn-jocher For now I am thinking whether I can adjust it to perform well on my datasets, where there are lots of overlapping and medium objects: image

Can I consider decreasing this parameter (2, 1.73...)is also limiting the size of outputting bounding boxes?

glenn-jocher commented 4 years ago

You should look at your labels.png to see your size distribution. Yes, changing exponent in the box equations from 2.0 to 1.6 will limit your output space from 0-4 to 0-3. This would presumably paired with an increase in anchor count, otherwise recall would suffer.

TaoXieSZ commented 4 years ago

@glenn-jocher Here: image

TaoXieSZ commented 4 years ago

And here's another dataset: image

glenn-jocher commented 4 years ago

@ChristopherSTAN yes these look pretty typical. You have some very large class imbalances as well. Or wait, it looks like your bar chart is plotted incorrectly, as there are 15 bins but it only goes up to 13. Looks like a plotting bug.

TODO: Fix labels.png bar chart.

glenn-jocher commented 4 years ago

Pushed a commit https://github.com/ultralytics/yolov5/commit/4ffd9779d378392f51321bd41dc88df487d4069b for improved plotting. No bug found in current plotting.

TaoXieSZ commented 4 years ago

Hi, dear Glenn,

I think it is a good time for your team to formularize, paperize your work, and SHOCK the world. It is really interesting to read your code.

glenn-jocher commented 4 years ago

@ChristopherSTAN haha, yes we do need to produce a publication, but we are still exploring design changes. Hopefully around the end of year we can send something to arxiv.

glenn-jocher commented 4 years ago

@ChristopherSTAN I have an idea, you could try modifying the L24 activation function in the Conv() layer from LeayReLU(0.1) to Swish() or Mish() to see if this helps wheat training. I've never tried this, but it may be possible to still start from pretraind weights when you do this:

https://github.com/ultralytics/yolov5/blob/5e970d45c44fff11d1eb29bfc21bed9553abf986/models/common.py#L18-L31

EDIT: You'll have to reduce your batch size as these will consume much greater GPU RAM when training.

TaoXieSZ commented 4 years ago

@glenn-jocher Interesting! I will try later. Now I am considering using coco dataset to increase training data by extracting intersecting classes. I think it will be a great trick to improve performance on custom datasets. If it works, I will apply a PR to see if you are interested.

Edit: I plan to upload some scripts. I am not sure how to name this operation. Maybe we can name it "Enriching data" or something else.

BTW, I am using EfficientDet on Wheat compete. But I am using yolov5 on two different datasets.

TaoXieSZ commented 4 years ago

@glenn-jocher That's my way: image Here I have a small dataset with 3600 images. But by extracting data from coco, we can have more than 30K. I am expecting how much it affect.

TaoXieSZ commented 4 years ago

@glenn-jocher Now I understand your feeling when training on COCO. I just use yolov5m and nearly 40K training images, it takes me 35min to run an epoch....

glenn-jocher commented 4 years ago

@ChristopherSTAN intersecting classes, that's a good term. Yes this would be very useful. OpenImages V5/6 have a lot of intersecting classes with coco.

Yes, COCO can be very slow to train on unfortunately.

dlawrences commented 4 years ago

@glenn-jocher That's my way: image Here I have a small dataset with 3600 images. But by extracting data from coco, we can have more than 30K. I am expecting how much it affect.

I would point out that this is not something you want to do on the long run, depending on the actual images of your own dataset. The COCO dataset may help the model to generalise on the objects, but usually the test dataset and the real world on which you are going to use your trained model are going to have its specifics around:

For the problems I am solving, I have also used the COCO dataset for the specific classes I am training. However, I am also decreasing the COCO images in my dataset once I have a new batch of real images annotated. And, obviously, one thing you need to make sure is not happening is having any COCO images in your val/test set if these are not in accordance to your actual real scenarios. This can screw up your model evaluation pretty bad.

TaoXieSZ commented 4 years ago

@dlawrences Thanks for your suggestions! It is my first time to add COCO images into my train set. And I have similar thought of test set to yours, I do not add extra images in to val set. Because I still want test set and dev set have same distribution.

Thanks again!

TaoXieSZ commented 4 years ago

@glenn-jocher I plan to try what this pro said: https://github.com/ultralytics/yolov3/issues/1098#issuecomment-663219984

Try Leaky ReLU first and then Mish.

glenn-jocher commented 4 years ago

@ChristopherSTAN ok! I think it's a good idea, because yolov3/4 demonstrated improved mAP with Mish and Swish, it's just using it for COCO training was next to impossible. Finetuning smaller datasets with it may be much more in reach however.

TaoXieSZ commented 4 years ago

@glenn-jocher And BTW, do you hear about Group Sampling?

Paper: https://openaccess.thecvf.com/content_CVPR_2019/papers/Ming_Group_Sampling_for_Scale_Invariant_Face_Detection_CVPR_2019_paper.pdf

It seems it can improve detection performance with simple implementation.

TaoXieSZ commented 4 years ago

@glenn-jocher I do like this:

class Conv(nn.Module):
    # Standard convolution
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Conv, self).__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        #self.act = nn.LeakyReLU(0.1, inplace=True) if act else nn.Identity()
        self.act = Mish() if act else nn.Identity()

But easily run out of CUDA...

So I use your implementation:

self.act = MemoryEfficientMish() if act else nn.Identity()

Still, we need cut a half of the batch size (8->4) with yolov5x.

glenn-jocher commented 4 years ago

@ChristopherSTAN hmm yes, the memory requirements are pretty terrible. Swish might be a happy middle ground between ReLU and Mish, as it is also a smooth function like Mish yet requires a lot less memory for training.

Oh, BTW, I just released v2.0, which has a few model and training improvements. YOLOv5x now scores 49.0 on COCO, up from 48.4 before. I'm not sure if this change will affect wheat training positively or negatively though. v2.0 includes breaking changes, as the models are constructed in a simpler way now, so you would need to clone a fresh copy of the repo if you want to start using it.

TaoXieSZ commented 4 years ago

@glenn-jocher I will start to try today! Long time no update the repo. I think my idea has reach bottleneck. It is time to move on and learn more.

TaoXieSZ commented 4 years ago

@glenn-jocher Bravo! I first train yolov5x on mixed dataset (30K of COCO + 3K of a small dataset) for nearly 50 epochs. Then train 150 epochs in only 3K dataset. It gives me 0.67 -> 0.71 mAP in test set!

glenn-jocher commented 4 years ago

@ChristopherSTAN oh, that's a big jump! What was the increase due to? The COCO pretraining? Mish/Swish was also mentioned above, or perhaps you used your intersecting classes idea?

TaoXieSZ commented 4 years ago

I don't see much improvement with Mish/Swish. The story is that: You know I am only using Colab, and my notebook just disconnected days ago. So I was angry and resumed it without coco dataset. And observed great improvement in val set.

Then I though a little bit: because the data are so important to deep learning models, the external data have improved the modeling ability of the mode. when train in the custom dataset (origin), we can see a great improvement.

Especially, this the single model on fold 0, but outperform my ensemble models on 5 folds.

With this observation, I will keep the pretrain model and resume it with k-fold, then ensemble it.

So, at last, thanks a lot to your great repo and hard working on COCO dataset.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

abhiagwl4262 commented 3 years ago

@ChristopherSTAN @glenn-jocher Can Anyone explain why setting the target confidence = 1.0 hurting the accuracy ? And why the equation below is giving better accuracy ?

tobj[b, a, gj, gi] = (1.0 - model.gr) + model.gr * giou.detach().clamp(0).type(tobj.dtype) # giou ratio

glenn-jocher commented 3 years ago

@abhiagwl4262 you may want to experiment both ways. The current implementation sets object confidence to observed iou.

abhiagwl4262 commented 3 years ago

@glenn-jocher I actually experimented and found that setting the target_confidence to 1.0 is giving me significant drop in accuracy. Do you have any intuition behind this observation?

glenn-jocher commented 3 years ago

@abhiagwl4262 the intention with the current implementation is to assist NMS in reducing lower quality boxes.

abhiagwl4262 commented 3 years ago

@glenn-jocher The predicted box can be (0-4.0)times of the anchor. You basically having a upper bound of 4.0 and lower bound of 1/4.0 or the anchor-GT ratio. Why you are you applying a lower bound? What is the significance of that?

glenn-jocher commented 3 years ago

@abhiagwl4262 the matching algorithm is attempting to match targets with suitable anchors. The matches should be neither too large, nor too small, so we use upper and lower bounds on the ratio to achieve this. Without the lower bounds, all anchors would match with small objects (we only want the small anchors to match with small objects).

abhiagwl4262 commented 3 years ago

@glenn-jocher Can you give a little idea of how you chose the values for yolo layer loss balacing as [4.0 - small object layer, 1.0 - medium object layer, 0.4- for large object layer] ?

glenn-jocher commented 3 years ago

empirical results

violet17 commented 3 years ago

The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. yolov3 suffers from this problem as well as yolov4.

For yolov5 I made sure to patch this error by sigmoiding all model outputs, while also ensuring that the centerpoint remained unchanged 1=fcn(0), so nominal zero outputs from the model would cause the nominal anchor size to be used. The current eqn constrains anchor multiples from a minimum of 0 to a maximum of 4, and the anchor-target matching has also been updated to be width-height multiple based, with a nominal upper threshold hyperparameter of 4.0.

Thanks for the great repo! Can I have some base questions on the width-height method?

  1. Why multiply 2 to σ(x) and subtract 0.5 in x,y coords of bounding box? Should 2σ(x)-0.5 be in range [-0.5,0.5]?

  2. Why multiply 2 to σ(x) in width,height of bounding box?

  3. Why the centerpoint should be 1, i.e. f(0)=1?

Thanks!

glenn-jocher commented 3 years ago

@violet17 the equation and intersection points were chosen for stability and for its suitability in replacing the unstable yolov3/v4 wh method.

joelcma commented 3 years ago

@glenn-jocher But what is the purpose of offsetting by -0.5? If you are expanding the output space from 0-1 to 0-2 and you offset by -0.5 the mid-point will be equal to 0.5 and not 0. So given x = 0 you get 0.5 output. Wouldn't it be more logical to offset by -1 to get 0 when given 0?

With the current formulation, if the network predicts t_x = 0 and the cell offset is 0.5, then the output will be 1, while intuitively it seems like it should perhaps be 0.5, the 0 value of the cell. Perhaps I misunderstand?

glenn-jocher commented 3 years ago

@joelcma you want a reference input to create a reference output for stability and ease of training. The average input (due to batchnorm) will be zero, and the average object will be in the middle (i.e. at 0.5) of a grid cell. We expand the output space to allow for predictions near 0 and 1 without stressing the sigmoid inputs to extremes.

joelcma commented 3 years ago

@glenn-jocher Thank you for taking the time to answer! And sorry because I have another question :D So what is the benefit of using a sigmoid over a bounded relu in this case?

glenn-jocher commented 3 years ago

@joelcma the benefit of any model architecture selection would be driven by empirical results, i.e. 'it works better'.

dhiman10 commented 1 year ago

I have tried it for export !python export.py --weights /content/drive/MyDrive/best.pt --include "coreml"

Could any one know how can I convert correctly and get the bounding box, score, and other things?

glenn-jocher commented 11 months ago

@dhiman10 after running export.py, you can get the bounding boxes, scores, etc. by using the CoreML framework to load the exported model and perform inference with it. You can refer to the CoreML documentation or examples for guidance on how to do this.