petuum / adaptdl

Resource-adaptive cluster scheduler for deep learning training.
https://adaptdl.readthedocs.io/
Apache License 2.0
422 stars 76 forks source link

Adaptive Batch Size for Single-GPU training #98

Closed gaow0007 closed 3 years ago

gaow0007 commented 3 years ago

It seems that single-GPU training does not support adaptive batch size and utilizes the default init_batch_size. Any special consideration here? https://github.com/petuum/adaptdl/blob/d83e4ceef0cf3863d842bb2744181379cc3cd0e7/adaptdl/adaptdl/goodput.py#L128

aurickq commented 3 years ago

@gaow0007 you are right. On a single GPU, the approximation for the gradient noise scale doesn't work very well, so we avoid scaling the batch size and learning rate to mitigate convergence issues.

gaow0007 commented 3 years ago

Thanks a lot!

jaywonchung commented 2 years ago

@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?

aurickq commented 2 years ago

@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?

Sure. You are right that it is because we are using the formula from Appendix A.1 of the GNS paper, which depends on having two or more gradients evaluated using the current model parameters. When there are two or more GPUs, then the per-GPU gradients can be used. When there is only a single GPU, then the formula results in a division by zero.

With that said, AdaptDL can still compute the GNS on a single GPU using one of the following ways:

  1. When gradient accumulation is used, AdaptDL can use the per-step gradients rather than the per-GPU gradients to achieve the same thing. However, gradient accumulation on a single GPU is unlikely to speed up training, so this is rarely done.
  2. Otherwise, AdaptDL tries to use the gradient from the current step together with the gradient from the previous step. However, since these two gradients are evaluated using different model parameters, it is only a biased estimation. We do not use this estimation for scaling the learning rate since it can noticeably degrade validation accuracy.
jaywonchung commented 2 years ago

Thanks a lot for the detailed explanation!

I blindly tried running adaptdl with one replica after removing the num_replicas == 1 condition (taking the False branch in np.where), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr and _grad_params.var are used - are these the first approach you described, or am I just computing some random value?

aurickq commented 2 years ago

Thanks a lot for the detailed explanation!

I blindly tried running adaptdl with one replica after removing the num_replicas == 1 condition (taking the False branch in np.where), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr and _grad_params.var are used - are these the first approach you described, or am I just computing some random value?

Assuming accumulation == False (otherwise that line of code would not run in the first place), it should be taking the second approach where it approximates the GNS based on gradients from the previous step.

jaywonchung commented 2 years ago

I see. Thanks for the reply! 👍