Using gradient accumulation

lorinczszabolcs commented 2 years ago

Checklist

I have searched related issues but cannot get the expected help. ✅
I have read the FAQ documentation but cannot get the expected help. ✅

Hi!

Let's say there is a model that runs with samples_per_gpu=16 for 40k iterations. In case if the model does not fit in memory, one would use gradient accumulation: samples_per_gpu=4 and cumulative_iters=4. Is it needed to run this version for 160k iterations (4x the original) if the same number of optimization steps need to be taken as in the original case? My current understanding is that if we run only for 40k iterations with cumulative_iters=4 one would end up with only 40k/4=10k optimization steps with effective batch size of 16, which is not the same as having 40k optimization steps with the original 16 batch size.

Thanks for your help in advance!

All the best, Szabi

imabackstabber commented 2 years ago

Yes I think so.Just like what you've said,it's about number of optimize_iter_steps.But for epoch based method like EpochBasedRunner in mmcv,I don't think more epochs need to be added,since in one epochs samples_per_gpu=4 will update 4x more than samples_per_gpu=16 settings. Hope it helps.

lorinczszabolcs commented 2 years ago

I see, thank you for the quick help.

Maybe a warning / note could be added about this property, since I first assumed that by setting samples_per_gpu=4 and cumulative_iters=4 it would essentially be equal to just having samples_per_gpu=16, even though the docstrings says "almost equals", it was giving the impression that apart from issues caused by using grad accumulation together with BN, it would be the same. Alternatively, the implementation could be changed to actually result in equivalent trainings.

open-mmlab / mmcv

Using gradient accumulation #1952