Add layer norm weight plus 1

microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.9k stars 345 forks source link

Open Yejing-Lai opened 7 months ago

Yejing-Lai commented 7 months ago

This PR aims to apply the apply_layernorm_1p flag. When set to True, we need to do layernorm.weigth + 1.

Yejing-Lai commented 5 months ago

Hi @tjruwase, Please kindly review~ This PR will fix the non-cuda accelerator layernorm accuracy issue. Thanks!