microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 345 forks source link

Add layer norm weight plus 1 #378

Open Yejing-Lai opened 7 months ago

Yejing-Lai commented 7 months ago

This PR aims to apply the apply_layernorm_1p flag. When set to True, we need to do layernorm.weigth + 1.

Yejing-Lai commented 5 months ago

Hi @tjruwase, Please kindly review~ This PR will fix the non-cuda accelerator layernorm accuracy issue. Thanks!