about the optimizer param group

microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.9k stars 346 forks source link

about the optimizer param group #387

Open L-hongbin opened 6 months ago

L-hongbin commented 6 months ago

in the optimizer/init.py，get_param_groups function Why use the "wd_mult" key instead of the "weight_decay"？Will the "wd_mult" parameter take effect in the optimizer?