Closed maleadt closed 2 months ago
521 is a prime number LWS must divide gws, so the only divisor is 521 or 1. If your device max LWS size is 256/512, then 1x1x1 is the only option so the result is expected.
same story with 1000001 , largest divisor here is 101, which will end with the best occupancy.
Driver works as expected, can you share what is your expectation here?
can you share what is your expectation here?
A way to query a "good" launch configuration that maximizes occupancy of a kernel without oversubscribing resources. I didn't realize that the driver returns a configuration that exactly covers the input size; I'm doing a bounds check at the start of my kernel anyway so I don't care about exact coverage. I guess ZE_extension_kernel_max_group_size_properties
is what I need then, although in the case of CUDA the block size returned by cuOccupancyMaxPotentialBlockSize
can be lower than the maximum thread count for a kernel (when the calculator determines that launching more threads wouldn't improve occupancy).
I see, unfortunatelly this function will not give you that information as it strictly needs to obey global work size passed as input.
ZE_extension_kernel_max_group_size_properties will tell you the max workgroup size that kernel can support. You can then use it to compute properly aligned global work size and via bounds checking in kernel drop unnecessary work items.
Thank you.
use it to compute properly aligned global work size
By ensuring the group size is a multiple of ze_kernel_preferred_group_size_properties_t
's preferredMultiple
?
Your GWS must be a multiple of LWS. so for 521 case, you may align GWS to 544 (32 alignment) which you can execute with 17 workgroups each with 32 work items. You would need to do bounds checking in kernel that global id is not greater then 520.
I work on oneAPI.jl, a package for programming Intel GPUs in Julia. To launch kernels, we often use
zeKernelSuggestGroupSize
to determine a good workgroup size. However, this function often gives strange or suboptimal results, for example, for this dummy (completely empty) kernel on a device withmaxTotalGroupSize=512
prime-sized inputs always return a single-element group:suggest_groupsize
is a simple wrapper aroundzeKernelSuggestGroupSize
:Am I misunderstanding how
zeKernelSuggestGroupSize
works or what purpose it serves? Or is this a broken implementation here in the driver? I assumed that it was similar tocuOccupancyMaxPotentialBlockSize
.For other input sizes, the returned configuration seems very suboptimal:
It doesn't seem optimal to launch 9901 groups where I could have used far fewer, larger groups.
I also came across
ZE_extension_kernel_max_group_size_properties
, but there hasn't been a driver release yet with that extension. As a workaround, I'm now usingmaxTotalGroupSize/2
(to account for the case where there may be lots of registers used).