microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.89k stars 344 forks source link

use split/squeeze instead of slice for performance #409

Closed polisettyvarma closed 4 months ago

polisettyvarma commented 4 months ago

GPU may not have perf difference but HPU perf improves with this by 13.8 %

tjruwase commented 4 months ago

@polisettyvarma, can you share some intuition behind this optimization for HPUs?

polisettyvarma commented 4 months ago

@tjruwase it is mainly due to inplace operations caused by slice