microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 346 forks source link

about the optimizer param group #387

Open L-hongbin opened 6 months ago

L-hongbin commented 6 months ago

in the optimizer/init.py,get_param_groups function Why use the "wd_mult" key instead of the "weight_decay"?Will the "wd_mult" parameter take effect in the optimizer?