Closed gaow0007 closed 3 years ago
@gaow0007 you are right. On a single GPU, the approximation for the gradient noise scale doesn't work very well, so we avoid scaling the batch size and learning rate to mitigate convergence issues.
Thanks a lot!
@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl
is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?
@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because
adaptdl
is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?
Sure. You are right that it is because we are using the formula from Appendix A.1 of the GNS paper, which depends on having two or more gradients evaluated using the current model parameters. When there are two or more GPUs, then the per-GPU gradients can be used. When there is only a single GPU, then the formula results in a division by zero.
With that said, AdaptDL can still compute the GNS on a single GPU using one of the following ways:
Thanks a lot for the detailed explanation!
I blindly tried running adaptdl
with one replica after removing the num_replicas == 1
condition (taking the False
branch in np.where
), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr
and _grad_params.var
are used - are these the first approach you described, or am I just computing some random value?
Thanks a lot for the detailed explanation!
I blindly tried running
adaptdl
with one replica after removing thenum_replicas == 1
condition (taking theFalse
branch innp.where
), but actually some non-NaN number for efficiency is computed. Is seems that_grad_params.spr
and_grad_params.var
are used - are these the first approach you described, or am I just computing some random value?
Assuming accumulation == False
(otherwise that line of code would not run in the first place), it should be taking the second approach where it approximates the GNS based on gradients from the previous step.
I see. Thanks for the reply! 👍
It seems that single-GPU training does not support adaptive batch size and utilizes the default init_batch_size. Any special consideration here? https://github.com/petuum/adaptdl/blob/d83e4ceef0cf3863d842bb2744181379cc3cd0e7/adaptdl/adaptdl/goodput.py#L128