ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.25k stars 3.45k forks source link

CSPResNeXt50-PANet-SPP #698

Closed LukeAI closed 4 years ago

LukeAI commented 4 years ago

Does this repo. support CSPResNeXt50-PANet-SPP? (https://github.com/WongKinYiu/CrossStagePartialNetworks/)

AlexeyABs support: https://github.com/AlexeyAB/darknet/issues/4406

My tests have found it to be a clear winner over yolov3-spp in terms of mAP and speed.

glenn-jocher commented 4 years ago

@clw5180 I'm not sure what the cause of the discrepancy is. It could be differences in the group convolutions as @WongKinYiu mentioned. Note that yolov3-spp.cfg trains to much higher mAP with this repo than with darknet, so actual technical problems are very unlikely. See https://github.com/ultralytics/yolov3#map

glenn-jocher commented 4 years ago

@isgursoy @WongKinYiu @clw5180 @hwijune @Spectra456 the current status as far as I know is that there is a slight difference in implementing some operations in the csresnext50-panet-spp.cfg file in this repo compared to darknet, such that simply running the training command below fails:

python3 train.py --cfg csresnext50-panet-spp.cfg

The fix is essentially described here: https://github.com/ultralytics/yolov3/issues/698#issuecomment-570441779, I just need to implement and push it. I'll try to get this done in the next couple days, and then the next step would be to verify the cfg functionality by comparing mAP here using test.py.

Once that's done we can try to train from scratch and perhaps look at balancing the 3 losses or evolving the hyperparameters for this particular cfg. But yes it's a bit frustrating and a mystery why the cfg trains so much higher on darknet at the moment.

AlexeyAB commented 4 years ago

Darknet uses grouped-convolutional in the same way as nVidia cuDNN library, so it should be the same as in Pytorch.

Hwijune commented 4 years ago

hi @WongKinYiu

origin yolov3 mask order [yolo] 6,7,8 [yolo] 3,4,5 [yolo] 0,1,2 cspnet mask order [yolo] 0,1,2 [yolo] 3,4,5 [yolo] 6,7,8

Is there any difference?

WongKinYiu commented 4 years ago

No, there is no different. It because the order of pyramid scales of FPN and PANet are different. image

Hwijune commented 4 years ago

No, there is no different. It because the order of pyramid scales of FPN and PANet are different. image

can't change the order, right?

[yolo] 0,1,2 [yolo] 3,4,5 [yolo] 6,7,8 >>>>> [yolo] 6,7,8 [yolo] 3,4,5 [yolo] 0,1,2

WongKinYiu commented 4 years ago

Yes, because the anchor size should match the grid size.

glenn-jocher commented 4 years ago

@WongKinYiu I see in the https://github.com/ultralytics/yolov3/issues/698#issuecomment-585209887 image YOLOv3 corresponds to the FPN architecture (with 4 output layers), with the last output for the smallest objects. There are basically two steps: downsample, then upsample (with crosslinks).

In the PANet example, are there 3 steps? downsample, upsample, downsample (with crosslinks from step 2 to 3)? Does this improve the mAP typically at the expense of more weights/computation?

WongKinYiu commented 4 years ago

@glenn-jocher Hello,

typically yes.

But there are many different methods can be used to avoid that, for example, BiFPN. image image

glenn-jocher commented 4 years ago

@WongKinYiu ah very interesting! Figure 2 shows a good summary of the differences. Have you tried to create a *.cfg for efficientnet, or for a BiFPN type network? The results on COCO seem to show substantial improvement over what we are doing.

Screen Shot 2020-02-12 at 6 22 03 PM
WongKinYiu commented 4 years ago

@glenn-jocher Hello,

I do not build such cfg file, but someone does. https://github.com/AlexeyAB/darknet/issues/4662

glenn-jocher commented 4 years ago

@WongKinYiu I see. Have you tried the 'Simplified PANet' that they show with CSPResNeXt50-PANet-SPP?

I did a brief search online for EfficientDet implementations but I could not find any good ones. The paper does not supply code, and 3rd party implementations don't show very good or reliable mAPs.

Would you be interested in trying to implement a BiFPN network?

glenn-jocher commented 4 years ago

@WongKinYiu ah I had another question. Why are the group convolutions necesary in CSPResNeXt50-PANet-SPP?

Have you tried using the basic Conv2d() instead, and were you able to determine performance improvements when moving from the basic convolutions to the group convolutions?

AlexeyAB commented 4 years ago

CSPResNeXt50 has too much filters (outputs), so without groups it will take a very large amount of memory, so you should decrease mini_batch size significantly. So better to use groups=4...16 https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-584406057

WongKinYiu commented 4 years ago

If there is no group convolution, it is a CSPResNet50-PANet-SPP.

WongKinYiu commented 4 years ago

@AlexeyAB @glenn-jocher

Hello, I think BiFPN which implemented by darknet is good enough. csdarknet53-panet-spp-bifpn.txt

model size ap ap50 ap75
CSPDarknet53-BiFPN 512x512 38.4 62.3 41.3
AlexeyAB commented 4 years ago

@WongKinYiu Hi,

But even BiFPN-optimal worse than PANet-not-optimal, while optimal should give ~+4.4% extra AP: https://github.com/WongKinYiu/CrossStagePartialNetworks#gpu-real-time-models

model size ap ap50 ap75
CSPDarknet53-BiFPN (optimal) 512x512 38.4 62.3 41.3
CSPDarknet53-PANet-SPP (not optimal) 512x512 38.7 61.3 41.7
WongKinYiu commented 4 years ago

@AlexeyAB Hello,

The anchor size of CSPDarknet53-BiFPN is not optimized due to my GPU RAM is insufficient to train with same setting as CSPResNeXt50-PANet-SPP (optimal).

AlexeyAB commented 4 years ago

@WongKinYiu

What do you mean? Memory consumption doesn't depend on achor size.

Do you mean that you trained?

Or did you train CSPResNeXt50-PANet-SPP (optimal) - with width=416 height=416 ?

WongKinYiu commented 4 years ago

the anchor size of CSPResNeXt50-PANet-SPP is designed for 416x416. (trained with width=416 height=416)

the anchor size of CSPResNeXt50-PANet-SPP (optimal) is optimized for 512x512. (trained with width=512 height=512)

https://github.com/ultralytics/yolov3/issues/698#issuecomment-586271292. (trained with width=416 height=416 due to memory is not enough trained with width=512 height=512)

AlexeyAB commented 4 years ago

@WongKinYiu Thanks! So you trained CSPDarknet53 with lower network resolution than CSPResNext50.

But there are compared two CSPDarknet53 models, not CSPResNext50:

model size ap ap50 ap75
CSPDarknet53-BiFPN (optimal) 512x512 38.4 62.3 41.3
CSPDarknet53-PANet-SPP (not optimal) 512x512 38.7 61.3 41.7

Are both these models trained with width=416 height=416 subdivisions=16 ?

Or as I see:

WongKinYiu commented 4 years ago

both of these two models are trained with width=416 height=416. the setting of CSPDarknet53-BiFPN (optimal) is as you see. i am not sure about the subdivision of CSPDarknet53-PANet-SPP (not optimal), but yes mosaic=0.

in https://github.com/WongKinYiu/CrossStagePartialNetworks#gpu-real-time-models CSPDarknet53-PANet-SPP (not optimal) and CSPResNet50-PANet-SPP (not optimal) are not trained by myself.

AlexeyAB commented 4 years ago

@WongKinYiu

both of these two models are trained with width=416 height=416.

So from this table we can't say what is better BiFPN vs PAN?

model size ap ap50 ap75
CSPDarknet53 BiFPN (optimal) trained 416x416 subdivisions=16 512x512 38.4 62.3 41.3
CSPDarknet53 PANet-SPP (not optimal) trained 416x416 subdivisions=4 or 8 or 16 512x512 38.7 61.3 41.7

WongKinYiu commented 4 years ago

currently 245k epoch, 10.5 loss.

WongKinYiu commented 4 years ago

@glenn-jocher @AlexeyAB update

Model Size AP AP50 AP75
CSPDarknet53 BiFPN (optimal) trained 416x416 subdivisions=16 512x512 38.4 62.3 41.3
CSPDarknet53 PANet-SPP (optimal) trained 416x416 subdivisions=16 512x512 41.6 64.1 45.0
AlexeyAB commented 4 years ago

@WongKinYiu @glenn-jocher So previous version of BiFPN is bad. Try to use new BiFPN version: https://github.com/AlexeyAB/darknet/issues/4662#issuecomment-587490873

glenn-jocher commented 4 years ago

@glenn-jocher @AlexeyAB update

Model Size AP AP50 AP75 CSPDarknet53 BiFPN (optimal) trained 416x416 subdivisions=16 512x512 38.4 62.3 41.3 CSPDarknet53 PANet-SPP (optimal) trained 416x416 subdivisions=16 512x512 41.6 64.1 45.0

@WongKinYiu wow great! What's the difference between the not-optimal and optimal versions of CSPDarknet53 PANet-SPP? The optimal version shows +3 mAP improvement, what differences did you make to get this?

WongKinYiu commented 4 years ago

not-optimal: all hyper-parameters are same as default yolov3. optimal: with ciou and your genetic algorithm, mosaic augmentation, scale sensitivity, iou threshold. (see [net] and [yolo] in cfg file https://github.com/ultralytics/yolov3/issues/698#issuecomment-586271292)

AlexeyAB commented 4 years ago

@glenn-jocher @WongKinYiu

Why CSPDarknet53s-PANet-SPP Ultralitics has lower AP than CSPDarknet53 PANet-SPP Darknet ?

Model Size AP AP50 AP75 URL cfg
YOLOv3-SPP (baseline) Ultralitics (optimal) trained 416x416 -batch=16 512x512 39.7 60.5 42.2 url cfg
CSPDarknet53s-PANet-SPP Ultralitics (optimal) trained 416x416 -batch=16 512x512 40.0 60.4 42.9 url cfg
CSPDarknet53 PANet-SPP Darknet (optimal) trained 416x416 subdivisions=16 512x512 41.6 64.1 45.0 url cfg

Both use:

The difference is only -

  1. Darkent uses pre-trained classifier-weights, while Ultralitics doesn't
  2. Darknet uses CIoU-loss while Ultralitics uses GIoU-loss?

What am I missing?

glenn-jocher commented 4 years ago

@AlexeyAB I don't know, this is a very good question. The gap is very large in mAP. I think what I should do is try to test mAP with CSPDarknet53 PANet-SPP Darknet first, to establish that the cfg loads the model correctly. I'll do that today.

Yes it is true I don't use any pretrained weights (I saw slightly worse results with darknet53.conv.74). I tried CIoU loss and did not see any added benefit compared to GIoU.

glenn-jocher commented 4 years ago

I used the linked urls and weights, and tested at 512 on my own with the following commands. Results are slightly higher than the earlier table. I was not able to test the last one, as there were new cfg entries it did not recognize. I will comment these and try again.

git clone https://github.com/ultralytics/yolov3
cd yolov3
python3 test.py --img 512 --weights ... --cfg ...
Model Size AP AP50 AP75 URL cfg
YOLOv3-SPP (baseline) Ultralytics (optimal) trained 416x416 -batch=16 512x512 40.2 61.3 - url cfg
CSPDarknet53s-PANet-SPP Ultralitics (optimal) trained 416x416 -batch=16 512x512 40.7 60.7 - url cfg
CSPDarknet53 PANet-SPP Darknet (optimal) trained 416x416 subdivisions=16 512x512 - - - url cfg
WongKinYiu commented 4 years ago

i am in a business trip, will provide some training info of YOLOv3-SPP (baseline) Ultralitics and CSPDarknet53s-PANet-SPP Ultralitics after back to office.

glenn-jocher commented 4 years ago

@WongKinYiu ok great! I got the last darknet model to run, but mAPs came back as 0.0. Note that I modified my default test nms --iou-thres from 0.5 to 0.6, as this produces a better balance of mAP@0.5:0.95 (best at --iou-thres 0.7) and mAP@0.5 (best at --iou-thres 0.5).

Also note the latest yolov3-spp.cfg baseline trains to 41.9/61.8 at 608 with the default settings. The training commands to reproduce this are here. The two seperate --img-size are train img-size and test img-size. Multi-scale train img sizes using this command will be 288 - 640.

python3 train.py --data coco2014.data --img-size 416 608 --epochs 273 --batch 16 --accum 4 --weights '' --device 0 --cfg yolov3-spp.cfg --multi
WongKinYiu commented 4 years ago

@glenn-jocher

Note that I modified my default test nms --iou-thres from 0.5 to 0.6, as this produces a better balance of mAP@0.5:0.95 (best at --iou-thres 0.7) and mAP@0.5 (best at --iou-thres 0.5).

Yes, I know. However, for the competition, we should use same IoU threshold for both mAP@0.5:0.95 and mAP@0.5.

Also note the latest yolov3-spp.cfg baseline trains to 41.9/61.8 with the default settings. The training commands to reproduce this are here. The two seperate --img-size are train img-size and test img-size. Multi-scale train img sizes using this command will be 288 - 640.

Thanks, I just use the default setting of the repo which I used to train the model. As I remember, that repo gets about 40.9 mAP@0.5:0.95 on your report. By the way, all of my results are obtained by test-dev set and your results are obtained by min-val set.

glenn-jocher commented 4 years ago

@WongKinYiu ah test-dev set could be a difference too then!

Well it seems some differences remain as the ultralytics repo can't load the best performing darknet CSPDarknet53s-PANet-SPP model then. These differences must be the source of the problem I think.

AlexeyAB commented 4 years ago

@glenn-jocher

Also note the latest yolov3-spp.cfg baseline trains to 41.9/61.8 at 608 with the default settings.

What is the difference between your training and this yolov3-spp.cfg https://github.com/WongKinYiu/CrossStagePartialNetworks/tree/pytorch#ms-coco ? Why such difference?

WongKinYiu commented 4 years ago

@AlexeyAB

I use this repo to train: https://github.com/ultralytics/yolov3/tree/a6f87a28e7595e71752583fb41340f9d1105d75f There are many improvements in these days on ultralytics.

AlexeyAB commented 4 years ago

@WongKinYiu @glenn-jocher So, I want to know what improvements have been made?

glenn-jocher commented 4 years ago

Hmmm well lots of small day to day changes. If I use the github /compare it doesn't show the date of that commit, but it shows that there are 400 commits since then, with many modifications: https://github.com/ultralytics/yolov3/compare/a6f87a28e7595e71752583fb41340f9d1105d75f...master#diff-04c6e90faac2675aa89e2176d2eec7d8

The README from then was showing 40.0/60.9 mAP, which is similar to what @WongKinYiu was seeing, vs today's README which shows 41.9/61.8.

The improvements are over many different parts, such as the NMS, which now uses multi-label, the augmentation, which has been set to zero, the loss function reduction, which I returned to mean() instead of sum(), the cosine scheduler implementation, the increase in the LR to 0.01 after cos was implemented, and maybe a few other tiny things. The architecture itself is the same (yolov3-spp.cfg).

Actually this is an important point. A lot of papers today are showing very outdated comparisons to YOLOv3, i.e. showing 33 mAP@0.5:0.95 like the EfficientDet paper, with a GPU latency of 51ms. The reality is the most recent YOLOv3-SPP model I trained is at 42.1 mAP@0.5:0.95, with a GPU latency of 12.8ms https://github.com/ultralytics/yolov3/issues/679#issuecomment-597219021, which puts it far better than their own D0-D2 models in both speed and mAP. I'm not sure how best to get that message out.

Screen Shot 2020-03-10 at 4 27 33 PM
AlexeyAB commented 4 years ago

@glenn-jocher So the main difference:

  1. NMS uses multi-label
  2. the augmentation, which has been set to zero - what does it mean, did you disable data augmentation?
  3. the loss function reduction, which I returned to mean() instead of sum() - are all the true-positive loss values averaged new_loss = sum_for_i( loss_obj, loss_cls, loss_bbox) / count ?
WongKinYiu commented 4 years ago

image

glenn-jocher commented 4 years ago

@AlexeyAB

Yes NMS uses multi-label now, which bumped up mAP about +0.3. Yes spatial augmentation seemed to hurt training, so I set it to zero, but left HSV augmentation on:

       'hsv_h': 0.0138,  # image HSV-Hue augmentation (fraction)
       'hsv_s': 0.678,  # image HSV-Saturation augmentation (fraction)
       'hsv_v': 0.36,  # image HSV-Value augmentation (fraction)
       'degrees': 1.98 * 0,  # image rotation (+/- deg)
       'translate': 0.05 * 0,  # image translation (+/- fraction)
       'scale': 0.05 * 0,  # image scale (+/- gain)
       'shear': 0.641 * 0}  # image shear (+/- deg)
  1. The loss is back to it's original form, using the PyTorch defaults, which is for example for the 3 yolo layers: loss_giou = (giou_1.mean() + giou_2.mean() + giou_3.mean()).sum()

I'm really hoping we might be able to merge the YOLO outputs some day so I can do away with this uncertainty in how to combine the losses from the different layers. ASFF seems to be an interesting step in that direction.

glenn-jocher commented 4 years ago

@AlexeyAB ah also another change I forgot to mention was I changed multi-scale to change the resolution every batch now, instead of every 10 batches before. This seemed to smooth the results a bit, epoch to epoch.

glenn-jocher commented 4 years ago

@WongKinYiu yes they look super similar to each other unfortunately. I'm not sure why we aren't seeing the same gains as the darknet training. It must have to do with the grouped convolutions I think.

AlexeyAB commented 4 years ago

@glenn-jocher

Yes NMS uses multi-label now, which bumped up mAP about +0.3.

Does it currently work in such a way? if there are 2 bboxes with IoU > iou_nms

  1. class1_prob = 0.5, class2_prob = 0.7
  2. class1_prob = 0.7, class2_prob = 0.5

Then it will remove class1_prob = 0.5 and class2_prob = 0.5, and will leave:

  1. class2_prob = 0.7
  2. class1_prob = 0.7

The loss is back to it's original form, using the PyTorch defaults, which is for example for the 3 yolo layers: loss_giou = (giou_1.mean() + giou_2.mean() + giou_3.mean()).sum()

Do you know how this changes the Delta during auto-differentiation in Pytorch? Do you apply it only for x,y,w,h and not for probs and obj?


Yes spatial augmentation seemed to hurt training, so I set it to zero, but left HSV augmentation on:

Yes, it may help to win compete, but may be it may hurt cross-domain accuracy when testing images/videos are not similar to MS COCO.

It seems it works well because Ultralitics uses letter_box-image-resizing by default, so it keeps aspect ratio and doesn't require large spatial image transformation. In the Darknet we can try to use jitter=0.1 letter_box=1 instead of jitter=0.3 letter_box=0 I think the higher network resolution - the more preferably to use jitter=0.1 letter_box=1

I'm really hoping we might be able to merge the YOLO outputs some day so I can do away with this uncertainty in how to combine the losses from the different layers.

What do you mean?

I changed multi-scale to change the resolution every batch now, instead of every 10 batches before. This seemed to smooth the results a bit, epoch to epoch.

Does it decrease training speed, because changing of network size requires time?

If we use dynamic_minibatch=1 in the Darknet, when we change width,height,mini_batch dynamically and should reallocate GPU-arrayes for each layer, it can decrease treaining speed 2x-3x times if we will use it after each iteration.

AlexeyAB commented 4 years ago

@WongKinYiu

Have you checked if scale_x_y=1.1 increases AP95 accuracy, while it decreases AP50 and AP75 but keeps the same AP50...95? https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/coco/results.md#mscoco


EfficientNetB0-Yolo was added to the OpenCV-dnn module

So it only requires to implement scale_x_y=1.1 for using csresnext50-panet-spp-original-optimal.cfg with OpenCV-dnn.

WongKinYiu commented 4 years ago

i have only done experiments for scale_x_y=1.05, scale_x_y=1.1, and scale_x_y=1.2 of different feature pyramids.

have u tested the inference speed of enetb0-yolo using opencv-dnn?

AlexeyAB commented 4 years ago

have u tested the inference speed of enetb0-yolo using opencv-dnn?

Not yet. I will test it on Intel CPU and Intel Myraid X neurochip

glenn-jocher commented 4 years ago

@AlexeyAB @WongKinYiu I made a simple Colab notebook to see the time effects of group/mix convolutions.

It times a tensor passing forward and backward (to mimic training) through a Conv2d() op. The speeds stay about the same even as the parameter count drops by >10X. So similar sized models using these ops may be much slower.

b=m(x), x=[16, 128, 38, 38], b=[16, 256, 38, 38]

    groups  time(ms)    params  shape m             
         1       5.1    294912  [256, 128, 3, 3]    
         2       4.2    147456  [256, 64, 3, 3]     
         4       4.2     73728  [256, 32, 3, 3]     
         8       4.9     36864  [256, 16, 3, 3]     
        16       6.9     18432  [256, 8, 3, 3]      
        32       6.1      9216  [256, 4, 3, 3]      
        64       2.6      4608  [256, 2, 3, 3]      
       128       2.0      2304  [256, 1, 3, 3]   
AlexeyAB commented 4 years ago

@glenn-jocher Yes, nVidia cuDNN work in the same way. Also Google Coral TPU-Edge neurochip doesn't use Grouped-conv, despite the fact that they advertise the EffecientDet/Net with grouped convolutions. https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html