intel / torch-xpu-ops

Apache License 2.0
23 stars 15 forks source link

Evaluate configurations of SYCL global and local range for kernel launch #135

Open fengyuan14 opened 5 months ago

fengyuan14 commented 5 months ago

🚀 The feature, motivation and pitch

We followed stock CUDA about grid and block configurations. For these configurations, stock CUDA has some NV GPU arch assumption. Even we followed similar configurations, we are not clear about what we can get from Xe arch.

  1. syclMaxWorkItemsPerEU: What we can get from Xe arch, when using it.
  2. syclMaxWorkItemsPerTile: We are using max sub-group size to deduce max number of work items per Tile. It is not accurate. When runtime chooses non-max-sub-group-size kernel (IGC's optimization), we might get insufficient occupancy.

https://github.com/intel/torch-xpu-ops/blob/e914ada988343c0515753360de68812ea42d0ec3/src/aten/sycl/Loops.h#L330

  int wg_sz = syclMaxWorkItemsPerEU();
  int num_wg = ceil_div<int>(N, wg_sz);
  int hw_max_num_wg = syclMaxWorkItemsPerTile() / wg_sz;
  num_wg = num_wg > hw_max_num_wg ? hw_max_num_wg : num_wg;
  sycl_kernel_submit(wg_sz * num_wg, wg_sz, getCurrentSYCLQueue(), ker);

Alternatives

We won't regard it as highest priority. We will discuss it when we need contribute SYCL kernels to in-tree, since,

  1. Limited GPU hardware.
  2. No performance exception on IPEX so far.

Additional context

No response

fengyuan14 commented 4 months ago

It assumes explicit scaling GPU resources when using syclMaxWorkItemsPerTile. Or we should consider all resources of a device.