Clarify semantics of `urKernelSuggestMaxCooperativeGroupCountExp`

JackAKirk commented 1 month ago

CC @0x12CC @nrspruit

In the discussion from here: https://github.com/oneapi-src/unified-runtime/pull/1246#issuecomment-1894446658

it was described that urKernelSuggestMaxCooperativeGroupCountExp maps to cudaOccupancyMaxActiveBlocksPerMultiprocessor which takes a kernel and other params, and returns the maximum number of blocks that can be simultaneously executed in a streaming multiprocessor (SM).

However I found this in the l0 documentation:

"Use zeKernelSuggestMaxCooperativeGroupCount to recommend max group count for device for cooperative functions that device supports."

The "device" word implies that the semantics of of urKernelSuggestMaxCooperativeGroupCountExp is the maximum number of blocks that can be simultaneously executed in a device. A device consists of multiple streaming multiprocessors. In such a case you need to multiply the max number of blocks that can be simultanously executed in a SM by the number of SMs in a device.

The number of SMs can only be retrieved by querying the device the kernel is to be run on. This information (the device to be run on) is not passed to urKernelSuggestMaxCooperativeGroupCountExp, nor can it be inferred from any of the other parameters. Therefore, there are two possibilities:

if the semantics is the max number of blocks per device, the interface needs to be changed.
if the semantics is the max number of blocks per SM, the documentation should be clarified IMO.

JackAKirk commented 1 month ago

For intel devices they seem to call SMs "Xe cores" or "sub-slices"

https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-1/gpu-offload-flow.html

0x12CC commented 1 month ago

@JackAKirk, thanks for pointing this out. I agree that what we have now is confusing.

For CUDA, the documentation says the following:

The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by cudaOccupancyMaxActiveBlocksPerMultiprocessor (or cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute cudaDevAttrMultiProcessorCount.

The maximum number of groups depends on the maximum number of groups per multiprocessor and the number of multiprocessors. I think it's possible to implement urKernelSuggestMaxCooperativeGroupCountExp for CUDA by returning the product of these two queries.

The number of SMs can only be retrieved by querying the device the kernel is to be run on. This information (the device to be run on) is not passed to urKernelSuggestMaxCooperativeGroupCountExp, nor can it be inferred from any of the other parameters.

I understood that both query results are device-dependent. The call to cudaDeviceGetAttribute for the cudaDevAttrMultiProcessorCount attribute requires an explicit device parameter. The result of cudaOccupancyMaxActiveBlocksPerMultiprocessor should depend on the current context's device. Couldn't we use cudaGetDevice to get the device index and run both queries? I don't think this would require any change to the UR API, but I may not understand how the CUDA adapter works.

if the semantics is the max number of blocks per SM, the documentation should be clarified IMO

I don't think these are the intended semantics. I'm not sure if L0 has any query for getting the occupancy information at that granularity.

JackAKirk commented 1 month ago

I understood that both query results are device-dependent. The call to cudaDeviceGetAttribute for the cudaDevAttrMultiProcessorCount attribute requires an explicit device parameter. The result of cudaOccupancyMaxActiveBlocksPerMultiprocessor should depend on the current context's device. Couldn't we use cudaGetDevice to get the device index and run both queries? I don't think this would require any change to the UR API, but I may not understand how the CUDA adapter works.

Sure we could work out the device that has been last used from the CUcontext that is currently set, but is this really the semantics of the query? This assumes that the kernel that the user wants to execute on is the last set cuContext/cuDevice, but the user could (and it seems reasonable to expect that they generally will) choose to execute the kernel on a different device? It isn't how the corresponding query from cuda runtime would be used. This is why there would be a device argument for the user to provide, because they are intentionally saying "if I execute on this device for this kernel then what is the max block size for the device wide sync to work".

If you really want to we could implement it as you suggest, but you'd have to make it clear that the query will only make sense if they ensure that their program has executed such that the last set cuContext/cuDevice corresponds to the sycl device that they actually want to execute the kernel on. They would then have to make sure that they do some kind of gpu operation immediately preceding this that uses the desired gpu for the device wide kernel sync. This seems extremely awkward and non desirable to me.

JackAKirk commented 1 month ago

This documentation might be useful to you: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization

0x12CC commented 3 weeks ago

Sure we could work out the device that has been last used from the CUcontext that is currently set, but is this really the semantics of the query? This assumes that the kernel that the user wants to execute on is the last set cuContext/cuDevice, but the user could (and it seems reasonable to expect that they generally will) choose to execute the kernel on a different device? It isn't how the corresponding query from cuda runtime would be used.

How does the cudaOccupancyMaxActiveBlocksPerMultiprocessor query work without an explicit device parameter? Does it use the device from the current context or is it device independent?

This is why there would be a device argument for the user to provide, because they are intentionally saying "if I execute on this device for this kernel then what is the max block size for the device wide sync to work".

I agree that this makes more sense, but I didn't think that's how cudaOccupancyMaxActiveBlocksPerMultiprocessor works in CUDA since there's no device parameter.

If you really want to we could implement it as you suggest, but you'd have to make it clear that the query will only make sense if they ensure that their program has executed such that the last set cuContext/cuDevice corresponds to the sycl device that they actually want to execute the kernel on. They would then have to make sure that they do some kind of gpu operation immediately preceding this that uses the desired gpu for the device wide kernel sync. This seems extremely awkward and non desirable to me.

I don't have a strong preference here. I agree that using the device from the current context is awkward for the user, but I'm not sure how to fix that. If cudaOccupancyMaxActiveBlocksPerMultiprocessor is using the device from the current context, it seems like the alternative would be to add an explicit device parameter to urKernelSuggestMaxCooperativeGroupCountExp and then set it as the current device before calling cudaOccupancyMaxActiveBlocksPerMultiprocessor in the function body.

JackAKirk commented 3 weeks ago

Sure we could work out the device that has been last used from the CUcontext that is currently set, but is this really the semantics of the query? This assumes that the kernel that the user wants to execute on is the last set cuContext/cuDevice, but the user could (and it seems reasonable to expect that they generally will) choose to execute the kernel on a different device? It isn't how the corresponding query from cuda runtime would be used.

How does the cudaOccupancyMaxActiveBlocksPerMultiprocessor query work without an explicit device parameter? Does it use the device from the current context or is it device independent?

Good point: apparently it also uses the last set device: https://stackoverflow.com/questions/68982996/why-is-cudaoccupancymaxactiveblockspermultiprocessor-independent-of-device

Because sycl doesn't expose a means of setting the device in a manner similar to cuda runtime, I think that you are right that irrespective of whether the semantics is "max blocks per sm" or "max blocks per device", we need to pass a device parameter to this interface.

it seems like the alternative would be to add an explicit device parameter to urKernelSuggestMaxCooperativeGroupCountExp and then set it as the current device before calling cudaOccupancyMaxActiveBlocksPerMultiprocessor in the function body.

Yeah I think this is the way to go.

hdelan commented 2 weeks ago

In the CUDA and HIP adapters a ur_kernel_handle_t is unique to a device. So you should be able to get all the info you need from the device using

Kernel->getProgram()->getDevice();

JackAKirk commented 2 weeks ago

In the CUDA and HIP adapters a ur_kernel_handle_t is unique to a device. So you should be able to get all the info you need from the device using
Kernel->getProgram()->getDevice();

OK great, in that case the interface is fine as it is. Closing the issue.

GeorgeWeb commented 1 week ago

CC @0x12CC @nrspruit @JackAKirk I am reusing this issue as it seems that we still need a clarification for an appropriate mapping to CUDA.

Result granularity of `urKernelSuggestMaxCooperativeGroupCountExp`

Semantics at the moment (per device): urKernelSuggestMaxCooperativeGroupCountExp returns the max number of active blocks per device. In that case, for CUDA we ought to multiply the result of cuOccupancyMaxActiveBlocksPerMultiprocessor by the number of SMs. For actual applications that use this functionality such as CUTLASS (see usage of SM occupancy in CUTLASS this will mean that the user code (in this case DPC++ SYCL) should query the number of SMs again and divide the result of urKernelSuggestMaxCooperativeGroupCountExp by that number. The Cuda interface really needs the calling code to the API not only to be able to pass the actual local (work-group) size and dynamic local (work-group) memory size that kernel is going to be executed with by the user, but also return the result at compute unit (SM) granularity. I don't want this to look like I am pointing out an issue but rather I am trying to understand how useful this is to have the per-device semantics, what applications are going to use it like this, and whether it is designed this way just because of Level-Zero constraint.
Proposed semantics (per compute unit): urKernelSuggestMaxCooperativeGroupCountExp returns the max number of active blocks per the fundamental hardware that processes work groups, i.e. for Nvidia Cuda it is SM, in AMD it is a Compute Unit, I am unsure whether it is a Xe core for Intel's current hardware.

However if the 2nd semantic proposal is impossible to implement for Level-Zero, then we have to do something about solving the current issue. One way would to try to generalise the current API (urKernelSuggestMaxCooperativeGroupCountExp) and have a switch on a granularity enum values such as OCCUPANCY_PER_COMPUTE_UNIT and OCCUPANCY_PER_DEVICE, so that Level-Zero can freely return Unsupported for OCCUPANCY_PER_COMPUTE_UNIT. That may be a little too much overengineering though.

If the above sounds like too much overhead for and unnecessary complication of the interface, I think it will still be okay for us to handle the Cuda requirements correctly in the SYCL runtime wherever this is called for CUDA by dividing the per-device granularity result by the number of SMs. I haven't experimented with that though and I am not entirely confident regarding correctness/reliability. I am also inclined to implement it that way, which would be the least disruptive change as the current interface stays the same. So the ur* query remains with the current per-device semantics and we calculate per-sm inside ext_oneapi_get_info from DPC++.

Another suggestion would be to have a different entry-point extension for per-compute unit granularity and leave urKernelSuggestMaxCooperativeGroupCountExp as is. The new extension will be implemented only for the backends that support that, which at the moment are Cuda and HIP, starting with Cuda initially.

Update:

The term Compute Unit seems to be more OpenCL and AMD focused terminology, but it is also the wording that is used across Unified Runtime and DPC++. There are certain intel specific device info properties that say per_eu (as in per Execution Unit) instead of per_cu, but I think since they are Intel-specific marked at the moment, this divergence in terminology is fine.

~Flags parameter for `urKernelSuggestMaxCooperativeGroupCountExp`~

~The second point I wanted to raise (currently very important for enabling CUTLASS applications with DPC++) is that the Cuda-targeted applications require to use the mapping to cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags with a different than the default flag (for example, to disable caching of globals, related to Nvidia's Unified L1/Texture Cache, for the query), taken from real use-case in CUTLASS).~ ~This means we should add flags parameter that if we generalise the current urKernelSuggestMaxCooperativeGroupCountExp across all backends, it should take a flags parameter that can be ignored considering not every backend API has a mapping to that (Cuda and HIP do though HIP actually still ignores the flags). This may require a UR enum type to define a set of meaningful flags across APIs, to start with what Cuda and HIP has at the moment.~

Update:

Looks like the CUTLASS porting team has just found out that newer interfaces of the library do not require the WithFlags version of the API, hence that point is irrelevant for now.

JackAKirk commented 1 week ago

A question for l0 developers is whether l0 driver api for this function has taken into account the use cases of this functionality (e.g. in the use cases similar to ones using sm_occupancy in cutlass)? Does it support them as much as the l0 devices are inherently capable to support them?

GeorgeWeb commented 1 week ago

@0x12CC @nrspruit I have updated my comment now after recent findings invalidated one part of it but the semantic of the query whether it is per compute unit or per device is the important matter, the other was a trivial one. I would appreciate feedback. Thanks!

After thinking about this more, this may be a different use case to be honest. It is correct that the maximum cooperative group count is device-wide which for Cuda means per SM occupancy multiplied by the number of SMs, so it is valid that to retrieve the max actively executing group count per SM would have to be implemented separately from this API. However, this API can aid the implementation by giving us the total number that we just have to divide by the number of SMs for Cuda to get the SM Occupancy.

oneapi-src / unified-runtime