[BUG] inconsistent optimizer naming and defaults

stillmatic commented 8 months ago

Describe the bug I've noticed a couple of minor inconsistencies with the Deepspeed provided optimizers.

I expect CPU adam and GPU adam to have identical naming. However, CPU adam has modelparams and adamw_mode while GPU adam has params and adam_w_mode. It's a bit annoying that the same instantiation code doesn't work for both optimizers.
The default weight decay on both optimizers is set to 0, which is confusing given that the default optimizer is set to use AdamW mode (and I think most practitioners do as well). My understanding is that AdamW with 0 weight decay is equivalent to Adam, and probably not what practitioners intended to use. The PyTorch default uses 0.01 for weight decay, and the docs claim

apex.optimizers.FusedAdam may be used as a drop-in replacement for torch.optim.AdamW, or torch.optim.Adam with adam_w_mode=False:

One option to maintain backwards compatibility is to create new AdamW optimizers for CPU and GPU, each of which just sets weight decay 0.01 by default. This would also match the torch implementation, which has Adam w/ weight decay = 0 and AdamW w/weight decay=0.01.

To Reproduce Deepspeed docs: https://deepspeed.readthedocs.io/en/latest/optimizers.html Torch adamw docs: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

Expected behavior

Arguments to CPU and GPU adam are identical
Default parameters match those of Pytorch's

loadams commented 7 months ago

@stillmatic - thanks for pointing this out, for the first part, we have a bug that I still needed to address for this, can you confirm #3906 is the same ask?

For the second part, I'm not certain and will look into that.

stillmatic commented 7 months ago

can you confirm https://github.com/microsoft/DeepSpeed/issues/3906 is the same ask?

It is very closely related - that one points out the same modelparams -> params inconsistency in the CPU Adam optimizer, but I also note a adamw_mode -> adam_w_mode inconsistency.

loadams commented 7 months ago

Thanks, that makes sense. I'll take a look at updating all of those to match.

microsoft / DeepSpeed

[BUG] inconsistent optimizer naming and defaults #5214