Slow Training Time for Segmentation Model

mapillary / inplace_abn

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

BSD 3-Clause "New" or "Revised" License

1.32k stars 187 forks source link

Slow Training Time for Segmentation Model #99

Open brucemuller opened 5 years ago

brucemuller commented 5 years ago

Dear Mapillary Developers

I'm implementing your segmentation model into my PyTorch code base for my PhD research. Problem: it's taking around 6 hours to train a single epoch on Vistas from scratch (on a single GTX 1080). Is this normal?

My feeling is that it shouldn't take this long as other models (e.g. u-net) take under an hour (about 15min). It also seems to lag on sending tensors to GPU if I use normal BN. If I use InPlaceABN instead of lagging on sending to GPU it does so on the backward call. Either way, it also seems to lag on printing the loss or using .item() on loss.

Can anyone suggest where it might be going wrong or how I could figure this kind of problem out? I've tried using your SingleGPU class but it didn't seem to help...

geekboood commented 5 years ago

Hi! I encounter the same issue and my training is about 3.5x slower too. Do you figure it out? It seems that end to end model will use more Batch Norm layers and it will lead to 2 times overhead. But 3.5 time slower is still an incredible overhead.

brucemuller commented 5 years ago

Hi @geekboood

So since then I found I could actually train a single epoch in under an hour if using a much more powerful GPU or multiple GPUs. Can you try this?

The problem I have now is the results after training for many epochs don't look good and I only achieve a mean IoU of about 8% over the validation set (on Vistas). I attached some example and the training curve. Do you have an idea of what could be the problem? I'd be really interested to know how you are training or what hyper parameters you found worked :) I'm not sure whether the original paper used pre-training for the segmentation task so maybe that's where the improvement lies... abn_vistas_loss.pdf abn_vistas_IoU.pdf abn_vistas_epoch200_ex6 abn_vistas_epoch200_ex7 abn_vistas_epoch200_ex8 abn_vistas_epoch200_ex9 abn_vistas_epoch200_ex10

geekboood commented 5 years ago

@brucemuller247 I roll back to the version compatibles with Pytorch 0.4.1 and my model trains normally. As for your problem, what is the backbone of your end-to-end network and what loss function do you use? Usually for semantic segmentation tasks, I use CrossEntropyLoss and it will converge to 0.0x.

brucemuller commented 5 years ago

@geekboood Thanks for your reply, good idea to roll back. Are you using multiple GPUs when it trains normally?

For backbone I use the WiderResNet (https://github.com/mapillary/inplace_abn/blob/master/models/wider_resnet.py) they state in the paper. For loss I also use CrossEntropyLoss but I'm not getting that convergence. What order of magnitude learning rate and batch size are you using? Are you using standard SGD and fixed learning rate? Are you using any pre-training or class weighting? Are you using the the same backbone and head Mapillary use in their paper? Are you using many augmentations? There's so many parameters/options to consider so I'd very much appreciate your help/insight!

AmeetR commented 5 years ago

@geekboood I'd also appreciate the answer to these same questions, I'm workin on a comparison study for my internship and it would be very helpful.

ducksoup commented 5 years ago

Can you try again with the latest version of the library (v1.0.2)?

vakkov commented 5 years ago

I can also confirm that I have a slow training time (on a 4 GPU DGX-1 system...) and kind of bad results (with the older version of inplace_abn, though). My network is based on TernausNetV2 (which includes the inplace_abn and its wideresnet-38 implementation as a decoder). I will later adapt the code for DistributedDataParallel and give the new version a go (I use InPlaceABNSync).

ducksoup commented 5 years ago

@vakkov thank you, please let us know about your findings!

Note: reproducing the results in "In-Place Activated BatchNorm for Memory-Optimized Training of DNNs" by training your own models is beyond the scope of this issues section. Unfortunately we don't have the resources necessary to assist github users with this task.

In order to keep the discussion focused on the inplace_abn library (which is the sole focus of this repo), and to streamline the resolution process, we will from now on immediately close off-topic issues.

Thank you for your understanding!

brucemuller commented 5 years ago

I'll try to run more again soon. What kind of training times are people getting? @vakkov