rwightman / efficientdet-pytorch

A PyTorch impl of EfficientDet faithful to the original Google impl w/ ported weights
Apache License 2.0
1.58k stars 293 forks source link

Training code #5

Closed betterhalfwzm closed 4 years ago

betterhalfwzm commented 4 years ago

Thank you very much for your project, when the training code release?

rwightman commented 4 years ago

Working on it, have some progress on a branch but have to take a break for a while. The goal is to have training that matches or beats the reference

glenn-jocher commented 4 years ago

@rwightman was reviewing effientdet paper and was surprised by a training value I'd missed before. Their weight decay is 4e-5, batch-size 128.

In ultralytics/yolov3, (and darknet also), weight decay is about 10x larger, and also applied more frequently I believe since we use batch-size 64 (weight decay is applied once per optimizer update every 64 images). This seems like quite a large discrepancy, I thought I'd ping you to see if you had any ideas?

rwightman commented 4 years ago

@glenn-jocher I haven't had too much time to run many experiments on this yet, got myself busy recently. I did manage to get one decent D0 training result so far, it was without a proper mAP eval, so was just selecting checkpoints from loss. Managed .324 with a 10 checkpoint avg. I was using the official training defaults, but limited to a batch size of 32 (16 x 2) with AMP. Training a D1 model on better cards right now, but still only 20 x 2 w/ AMP.

I will say, the default hparams are right on the edge of stability. Especially with AMP enabled, the loss scaling takes a big dive before it recovers. In my first trials I was using SGD w/ nesterov and using pytorch' default BN momentum/eps. I had to change the BN momentum to match TF impl and turn nesterov off for stability at the LR they used.

I haven't played with the weight decay yet. I did notice that a lot of their defaults were based on the TPU Retinanet impl. That uses 1e-4 and batch size 64 as the default so I'm not sure if they spent much effort searching for something better?

Curious, with yolov4 out, are you going to bring that over to PT too? I was thinking of throwing some of the extras (CIoU, etc) from there at this eventually...

rwightman commented 4 years ago

@glenn-jocher Oh, forgot to mention, the official TF impl looks like it uses EMA weight averaging by default. As you noticed when you tried it with yolo, it doesn't necessarily seem to be an improvement with obj detection + cosine LR schedule, I observed the same for my D0 run. D1 is heading in that direction...

glenn-jocher commented 4 years ago

@rwightman ah yes, I'd also forgot to mention their insanely high LR. I'll do my best to make a little side by side comparison here. The high LR must go hand in hand with focal loss, its the main differentiator I believe. I think focal loss may be instrumental for detection schemes with background treated as an additional class (like efficientdet). In yolov3 it does not seem to help, probably because we have seperate objectness and classification BCE losses, reducing the imbalance.

  efficientdet/D0 ultralytics/yolov3
epochs 300 300
batch_size 128 64
optim SGD SGD+nesterov
initial_lr 0.16 0.01
momentum 0.90 0.95
weight_decay 4e-5 4e-4
EMA 0.9998 0.9999
BN 0.99, 1e-3 0.97, 1e-4
glenn-jocher commented 4 years ago

@rwightman about yolov4, I'm pretty confused in general. There are several methods they apply in yolov4, a couple of which came from me (mosaic loader, scale_x_y), which they were nice enough to mention in the acknowledgements section, which may be beneficial, but I'm not convinced by the new architecture. They report 43.5AP in the paper, whereas I'm reporting 43.1AP using original yolov3-spp, and have trained yolov3-spp up to about 44.6AP (unpublished, single-scale inference) using different anchors and larger multi-scale.

The current ultralytics/yolov3 repo can train the new yolov4.cfg files (omitting a few of the extras which I have not implemented), but to get an apples to apples comparison of the actual architecture change effects I ran yolov4.cfg without the extras (orange below), using the same yolov3-spp anchors and same training settings, and I actually get worse performance compared to yolov3-spp (blue):

results

Before yolov4 was published I was working on incorporating a few changes into a new repo. I'm aiming to release it publically by the end of this month probably, and hopefully it will exceed yolov4 in AP and GPU latency, but as of now I'm facing overtraining effects on my models halfway through training. Not sure how much is EMA related. It's frustrating because sometimes changes show positive effects for the first half of training, but then overfit around 100-150 epochs, and end up with a lower final mAP. So I don't have an easy way to iterate quickly, other than to parallelize trainings, as each training will take a week+.