rwightman / efficientdet-pytorch

A PyTorch impl of EfficientDet faithful to the original Google impl w/ ported weights
Apache License 2.0
1.58k stars 293 forks source link

Training time differences #193

Closed Asifzm closed 3 years ago

Asifzm commented 3 years ago

Hi, Thank you for your great repo, I highly appreciate it. I have updated efficientdet-pytorch repo from previous (August 2020) version. Updated timm and torch from 1.4 to 1.7. When running efficientdet_d1, I noticed training loop takes about twice the time. More specifically, backbone forward takes twice the time. I noticed efficientnet_b1 works with SiLU activation instead of Swishme in previous version. Could that be the main reason for the time difference? Is there something else? I work with one GPU, without parallelism, model EMA or mixed precision.

Thank you!

rwightman commented 3 years ago

@Asifzm I'm moving this to discussions because it's not a bug

There have been quite a few changes to effdet and pytorch in that timespan. I've noticed a number of hardware and version specific performance regressions in PyTorch, esp 1.7. I'd try it in different releases, the cuda 10.2 variants of 1.7 will be closer to 1.4, the 11.x may have some specific performance issues with your exact card. You can also try 1.8 and NGC containers, I usually train on NGC containers... 20.12 and 21.02 both seem pretty good.

In terms of this codebase, I made a number of changes in the summer that impact performance, some gained speed, but others lost some speed in exchange for better loss stability and results.

You can try experimenting with --jit-loss to enable jit scripting the loss fn for some speed gain, but it can blow up memory usage on the GPU.

You can also revert back to older loss fn with --legacy-focal, it has different throughput/memory behaviour, usually a bit faster, but it's a bit less numerically stable than the current one

And finally, you can try --torchscript to train with the whole model + bench torchscripted, I often find this improves the overall throughput.

The SiLU activation change should be an overall performance gain for PyTorch 1.7/1.8.