Closed fengyuan14 closed 1 month ago
Additional three copies are introduced by Autocast. torch-xpu-ops aligns Autocast policy with PyTorch CUDA, where LayerNorm requires FP32 in computation. And in IPEX, LayerNorm could stay on BF16 according to IPEX custom Autocast policy.
Close won't fix, as currently we follow rules to align with CUDA impl and guarantee accuracy. If we have performance consideration in future, we can file new issue.
🐛 Describe the bug
2.5 aten::layernorm introduced 3 aten::copy, that make the latency dropped from 150us to 401us.
Versions
Latest torch-xpu-ops