intel / torch-xpu-ops

Apache License 2.0
28 stars 20 forks source link

Performance: Nonzero: Worse host overhead compared with IPEX #969

Open fengyuan14 opened 3 weeks ago

fengyuan14 commented 3 weeks ago

🐛 Describe the bug

                              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls       
non override         aten::nonzero         5.50%      63.456ms        54.60%     630.302ms     489.365us       5.160ms         4.35%      34.468ms      26.761us          1288  
override             aten::nonzero         5.40%      58.551ms        52.64%     570.870ms     443.222us       6.688ms         5.55%      34.737ms      26.970us          1288

Versions

Latest torch-xpu-ops vs IPEX 2.3 implementation.

majing921201 commented 2 weeks ago

The low performance is caused by SYCL API, which we used to query kernel specific max work group size. We file issue to compiler to track this issue. https://github.com/intel/llvm/issues/15824