We followed stock CUDA about grid and block configurations. For these configurations, stock CUDA has some NV GPU arch assumption. Even we followed similar configurations, we are not clear about what we can get from Xe arch.
syclMaxWorkItemsPerEU: What we can get from Xe arch, when using it.
syclMaxWorkItemsPerTile: We are using max sub-group size to deduce max number of work items per Tile. It is not accurate. When runtime chooses non-max-sub-group-size kernel (IGC's optimization), we might get insufficient occupancy.
🚀 The feature, motivation and pitch
We followed stock CUDA about grid and block configurations. For these configurations, stock CUDA has some NV GPU arch assumption. Even we followed similar configurations, we are not clear about what we can get from Xe arch.
https://github.com/intel/torch-xpu-ops/blob/e914ada988343c0515753360de68812ea42d0ec3/src/aten/sycl/Loops.h#L330
Alternatives
We won't regard it as highest priority. We will discuss it when we need contribute SYCL kernels to in-tree, since,
Additional context
No response