Open stillmatic opened 8 months ago
@stillmatic - thanks for pointing this out, for the first part, we have a bug that I still needed to address for this, can you confirm #3906 is the same ask?
For the second part, I'm not certain and will look into that.
can you confirm https://github.com/microsoft/DeepSpeed/issues/3906 is the same ask?
It is very closely related - that one points out the same modelparams -> params
inconsistency in the CPU Adam optimizer, but I also note a adamw_mode -> adam_w_mode
inconsistency.
Thanks, that makes sense. I'll take a look at updating all of those to match.
Describe the bug I've noticed a couple of minor inconsistencies with the Deepspeed provided optimizers.
modelparams
andadamw_mode
while GPU adam hasparams
andadam_w_mode
. It's a bit annoying that the same instantiation code doesn't work for both optimizers.One option to maintain backwards compatibility is to create new AdamW optimizers for CPU and GPU, each of which just sets weight decay 0.01 by default. This would also match the torch implementation, which has Adam w/ weight decay = 0 and AdamW w/weight decay=0.01.
To Reproduce Deepspeed docs: https://deepspeed.readthedocs.io/en/latest/optimizers.html Torch adamw docs: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
Expected behavior