microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.35k stars 4.1k forks source link

[BUG] inconsistent optimizer naming and defaults #5214

Open stillmatic opened 8 months ago

stillmatic commented 8 months ago

Describe the bug I've noticed a couple of minor inconsistencies with the Deepspeed provided optimizers.

  1. I expect CPU adam and GPU adam to have identical naming. However, CPU adam has modelparams and adamw_mode while GPU adam has params and adam_w_mode. It's a bit annoying that the same instantiation code doesn't work for both optimizers.
  2. The default weight decay on both optimizers is set to 0, which is confusing given that the default optimizer is set to use AdamW mode (and I think most practitioners do as well). My understanding is that AdamW with 0 weight decay is equivalent to Adam, and probably not what practitioners intended to use. The PyTorch default uses 0.01 for weight decay, and the docs claim

apex.optimizers.FusedAdam may be used as a drop-in replacement for torch.optim.AdamW, or torch.optim.Adam with adam_w_mode=False:

One option to maintain backwards compatibility is to create new AdamW optimizers for CPU and GPU, each of which just sets weight decay 0.01 by default. This would also match the torch implementation, which has Adam w/ weight decay = 0 and AdamW w/weight decay=0.01.

To Reproduce Deepspeed docs: https://deepspeed.readthedocs.io/en/latest/optimizers.html Torch adamw docs: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

Expected behavior

loadams commented 7 months ago

@stillmatic - thanks for pointing this out, for the first part, we have a bug that I still needed to address for this, can you confirm #3906 is the same ask?

For the second part, I'm not certain and will look into that.

stillmatic commented 7 months ago

can you confirm https://github.com/microsoft/DeepSpeed/issues/3906 is the same ask?

It is very closely related - that one points out the same modelparams -> params inconsistency in the CPU Adam optimizer, but I also note a adamw_mode -> adam_w_mode inconsistency.

loadams commented 7 months ago

Thanks, that makes sense. I'll take a look at updating all of those to match.