use split/squeeze instead of slice for performance

microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.89k stars 344 forks source link

Closed polisettyvarma closed 4 months ago

polisettyvarma commented 4 months ago

GPU may not have perf difference but HPU perf improves with this by 13.8 %

tjruwase commented 4 months ago

@polisettyvarma, can you share some intuition behind this optimization for HPUs?

polisettyvarma commented 4 months ago

@tjruwase it is mainly due to inplace operations caused by slice