Closed fengyuan14 closed 4 days ago
Basing on the consideration of accuracy, we followed the PyTorch CUDA implementation, using Welford algorithm and similar kernel template. Will improve the kernel template with vectorized load/store.
No response
https://github.com/intel/torch-xpu-ops/pull/933
Merged
🚀 The feature, motivation and pitch
Basing on the consideration of accuracy, we followed the PyTorch CUDA implementation, using Welford algorithm and similar kernel template. Will improve the kernel template with vectorized load/store.
Alternatives
No response
Additional context
No response