intel / torch-xpu-ops

Apache License 2.0
30 stars 21 forks source link

Performance: LayerNorm: Worse host overhead due to additional copies introduced #977

Closed fengyuan14 closed 1 month ago

fengyuan14 commented 1 month ago

🐛 Describe the bug

2.5 aten::layernorm introduced 3 aten::copy, that make the latency dropped from 150us to 401us.

Versions

Latest torch-xpu-ops

fengyuan14 commented 1 month ago

Additional three copies are introduced by Autocast. torch-xpu-ops aligns Autocast policy with PyTorch CUDA, where LayerNorm requires FP32 in computation. And in IPEX, LayerNorm could stay on BF16 according to IPEX custom Autocast policy.

fengyuan14 commented 1 month ago

Close won't fix, as currently we follow rules to align with CUDA impl and guarantee accuracy. If we have performance consideration in future, we can file new issue.