intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.1k stars 229 forks source link

Weird result with (or misunderstanding of) zeKernelSuggestGroupSize #725

Closed maleadt closed 2 months ago

maleadt commented 2 months ago

I work on oneAPI.jl, a package for programming Intel GPUs in Julia. To launch kernels, we often use zeKernelSuggestGroupSize to determine a good workgroup size. However, this function often gives strange or suboptimal results, for example, for this dummy (completely empty) kernel on a device with maxTotalGroupSize=512 prime-sized inputs always return a single-element group:

julia> k = @oneapi launch=false identity(nothing)

julia> oneL0.suggest_groupsize(k.fun, (521,1,1))
oneAPI.oneL0.ZeDim3(1, 1, 1)

suggest_groupsize is a simple wrapper around zeKernelSuggestGroupSize:

julia> x, y, z = Ref{UInt32}(), Ref{UInt32}(), Ref{UInt32}();

julia> oneL0.zeKernelSuggestGroupSize(k.fun, 521, 1, 1, x, y, z)

julia> x[]
0x00000001

julia> y[]
0x00000001

julia> z[]
0x00000001

Am I misunderstanding how zeKernelSuggestGroupSize works or what purpose it serves? Or is this a broken implementation here in the driver? I assumed that it was similar to cuOccupancyMaxPotentialBlockSize.

For other input sizes, the returned configuration seems very suboptimal:

julia> oneL0.suggest_groupsize(k.fun, (1000001,1,1))
oneAPI.oneL0.ZeDim3(101, 1, 1)

julia> cld(1000001, 101)
9901

It doesn't seem optimal to launch 9901 groups where I could have used far fewer, larger groups.

I also came across ZE_extension_kernel_max_group_size_properties, but there hasn't been a driver release yet with that extension. As a workaround, I'm now using maxTotalGroupSize/2 (to account for the case where there may be lots of registers used).

MichalMrozek commented 2 months ago

521 is a prime number LWS must divide gws, so the only divisor is 521 or 1. If your device max LWS size is 256/512, then 1x1x1 is the only option so the result is expected.

same story with 1000001 , largest divisor here is 101, which will end with the best occupancy.

Driver works as expected, can you share what is your expectation here?

maleadt commented 2 months ago

can you share what is your expectation here?

A way to query a "good" launch configuration that maximizes occupancy of a kernel without oversubscribing resources. I didn't realize that the driver returns a configuration that exactly covers the input size; I'm doing a bounds check at the start of my kernel anyway so I don't care about exact coverage. I guess ZE_extension_kernel_max_group_size_properties is what I need then, although in the case of CUDA the block size returned by cuOccupancyMaxPotentialBlockSize can be lower than the maximum thread count for a kernel (when the calculator determines that launching more threads wouldn't improve occupancy).

MichalMrozek commented 2 months ago

I see, unfortunatelly this function will not give you that information as it strictly needs to obey global work size passed as input.

ZE_extension_kernel_max_group_size_properties will tell you the max workgroup size that kernel can support. You can then use it to compute properly aligned global work size and via bounds checking in kernel drop unnecessary work items.

maleadt commented 2 months ago

Thank you.

use it to compute properly aligned global work size

By ensuring the group size is a multiple of ze_kernel_preferred_group_size_properties_t's preferredMultiple?

MichalMrozek commented 2 months ago

Your GWS must be a multiple of LWS. so for 521 case, you may align GWS to 544 (32 alignment) which you can execute with 17 workgroups each with 32 work items. You would need to do bounds checking in kernel that global id is not greater then 520.