Evaluate configurations of SYCL global and local range for kernel launch

🚀 The feature, motivation and pitch

We followed stock CUDA about grid and block configurations. For these configurations, stock CUDA has some NV GPU arch assumption. Even we followed similar configurations, we are not clear about what we can get from Xe arch.

syclMaxWorkItemsPerEU: What we can get from Xe arch, when using it.
syclMaxWorkItemsPerTile: We are using max sub-group size to deduce max number of work items per Tile. It is not accurate. When runtime chooses non-max-sub-group-size kernel (IGC's optimization), we might get insufficient occupancy.

https://github.com/intel/torch-xpu-ops/blob/e914ada988343c0515753360de68812ea42d0ec3/src/aten/sycl/Loops.h#L330

  int wg_sz = syclMaxWorkItemsPerEU();
  int num_wg = ceil_div<int>(N, wg_sz);
  int hw_max_num_wg = syclMaxWorkItemsPerTile() / wg_sz;
  num_wg = num_wg > hw_max_num_wg ? hw_max_num_wg : num_wg;
  sycl_kernel_submit(wg_sz * num_wg, wg_sz, getCurrentSYCLQueue(), ker);

Alternatives

We won't regard it as highest priority. We will discuss it when we need contribute SYCL kernels to in-tree, since,

Limited GPU hardware.
No performance exception on IPEX so far.

Additional context

No response

intel / torch-xpu-ops

Evaluate configurations of SYCL global and local range for kernel launch #135

🚀 The feature, motivation and pitch

Alternatives

Additional context