Open JackAKirk opened 1 month ago
For intel devices they seem to call SMs "Xe cores" or "sub-slices"
https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-1/gpu-offload-flow.html
@JackAKirk, thanks for pointing this out. I agree that what we have now is confusing.
For CUDA, the documentation says the following:
The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by cudaOccupancyMaxActiveBlocksPerMultiprocessor (or cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute cudaDevAttrMultiProcessorCount.
The maximum number of groups depends on the maximum number of groups per multiprocessor and the number of multiprocessors. I think it's possible to implement urKernelSuggestMaxCooperativeGroupCountExp
for CUDA by returning the product of these two queries.
The number of SMs can only be retrieved by querying the device the kernel is to be run on. This information (the device to be run on) is not passed to
urKernelSuggestMaxCooperativeGroupCountExp
, nor can it be inferred from any of the other parameters.
I understood that both query results are device-dependent. The call to cudaDeviceGetAttribute
for the cudaDevAttrMultiProcessorCount
attribute requires an explicit device
parameter. The result of cudaOccupancyMaxActiveBlocksPerMultiprocessor
should depend on the current context's device. Couldn't we use cudaGetDevice
to get the device index and run both queries? I don't think this would require any change to the UR API, but I may not understand how the CUDA adapter works.
if the semantics is the max number of blocks per SM, the documentation should be clarified IMO
I don't think these are the intended semantics. I'm not sure if L0 has any query for getting the occupancy information at that granularity.
I understood that both query results are device-dependent. The call to
cudaDeviceGetAttribute
for thecudaDevAttrMultiProcessorCount
attribute requires an explicitdevice
parameter. The result ofcudaOccupancyMaxActiveBlocksPerMultiprocessor
should depend on the current context's device. Couldn't we usecudaGetDevice
to get the device index and run both queries? I don't think this would require any change to the UR API, but I may not understand how the CUDA adapter works.
Sure we could work out the device that has been last used from the CUcontext that is currently set, but is this really the semantics of the query? This assumes that the kernel that the user wants to execute on is the last set cuContext/cuDevice, but the user could (and it seems reasonable to expect that they generally will) choose to execute the kernel on a different device? It isn't how the corresponding query from cuda runtime would be used. This is why there would be a device argument for the user to provide, because they are intentionally saying "if I execute on this device for this kernel then what is the max block size for the device wide sync to work".
If you really want to we could implement it as you suggest, but you'd have to make it clear that the query will only make sense if they ensure that their program has executed such that the last set cuContext/cuDevice corresponds to the sycl device that they actually want to execute the kernel on. They would then have to make sure that they do some kind of gpu operation immediately preceding this that uses the desired gpu for the device wide kernel sync. This seems extremely awkward and non desirable to me.
This documentation might be useful to you: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization
Sure we could work out the device that has been last used from the CUcontext that is currently set, but is this really the semantics of the query? This assumes that the kernel that the user wants to execute on is the last set cuContext/cuDevice, but the user could (and it seems reasonable to expect that they generally will) choose to execute the kernel on a different device? It isn't how the corresponding query from cuda runtime would be used.
How does the cudaOccupancyMaxActiveBlocksPerMultiprocessor
query work without an explicit device parameter? Does it use the device from the current context or is it device independent?
This is why there would be a device argument for the user to provide, because they are intentionally saying "if I execute on this device for this kernel then what is the max block size for the device wide sync to work".
I agree that this makes more sense, but I didn't think that's how cudaOccupancyMaxActiveBlocksPerMultiprocessor
works in CUDA since there's no device parameter.
If you really want to we could implement it as you suggest, but you'd have to make it clear that the query will only make sense if they ensure that their program has executed such that the last set cuContext/cuDevice corresponds to the sycl device that they actually want to execute the kernel on. They would then have to make sure that they do some kind of gpu operation immediately preceding this that uses the desired gpu for the device wide kernel sync. This seems extremely awkward and non desirable to me.
I don't have a strong preference here. I agree that using the device from the current context is awkward for the user, but I'm not sure how to fix that. If cudaOccupancyMaxActiveBlocksPerMultiprocessor
is using the device from the current context, it seems like the alternative would be to add an explicit device parameter to urKernelSuggestMaxCooperativeGroupCountExp
and then set it as the current device before calling cudaOccupancyMaxActiveBlocksPerMultiprocessor
in the function body.
Sure we could work out the device that has been last used from the CUcontext that is currently set, but is this really the semantics of the query? This assumes that the kernel that the user wants to execute on is the last set cuContext/cuDevice, but the user could (and it seems reasonable to expect that they generally will) choose to execute the kernel on a different device? It isn't how the corresponding query from cuda runtime would be used.
How does the
cudaOccupancyMaxActiveBlocksPerMultiprocessor
query work without an explicit device parameter? Does it use the device from the current context or is it device independent?
Good point: apparently it also uses the last set device: https://stackoverflow.com/questions/68982996/why-is-cudaoccupancymaxactiveblockspermultiprocessor-independent-of-device
Because sycl doesn't expose a means of setting the device in a manner similar to cuda runtime, I think that you are right that irrespective of whether the semantics is "max blocks per sm" or "max blocks per device", we need to pass a device parameter to this interface.
it seems like the alternative would be to add an explicit device parameter to
urKernelSuggestMaxCooperativeGroupCountExp
and then set it as the current device before callingcudaOccupancyMaxActiveBlocksPerMultiprocessor
in the function body.
Yeah I think this is the way to go.
In the CUDA and HIP adapters a ur_kernel_handle_t
is unique to a device. So you should be able to get all the info you need from the device using
Kernel->getProgram()->getDevice();
In the CUDA and HIP adapters a
ur_kernel_handle_t
is unique to a device. So you should be able to get all the info you need from the device usingKernel->getProgram()->getDevice();
OK great, in that case the interface is fine as it is. Closing the issue.
CC @0x12CC @nrspruit @JackAKirk I am reusing this issue as it seems that we still need a clarification for an appropriate mapping to CUDA.
urKernelSuggestMaxCooperativeGroupCountExp
Semantics at the moment (per device): urKernelSuggestMaxCooperativeGroupCountExp
returns the max number of active blocks per device. In that case, for CUDA we ought to multiply the result of cuOccupancyMaxActiveBlocksPerMultiprocessor
by the number of SMs.
For actual applications that use this functionality such as CUTLASS (see usage of SM occupancy in CUTLASS this will mean that the user code (in this case DPC++ SYCL) should query the number of SMs again and divide the result of urKernelSuggestMaxCooperativeGroupCountExp
by that number.
The Cuda interface really needs the calling code to the API not only to be able to pass the actual local (work-group) size and dynamic local (work-group) memory size that kernel is going to be executed with by the user, but also return the result at compute unit (SM) granularity.
I don't want this to look like I am pointing out an issue but rather I am trying to understand how useful this is to have the per-device semantics, what applications are going to use it like this, and whether it is designed this way just because of Level-Zero constraint.
Proposed semantics (per compute unit): urKernelSuggestMaxCooperativeGroupCountExp
returns the max number of active blocks per the fundamental hardware that processes work groups, i.e. for Nvidia Cuda it is SM, in AMD it is a Compute Unit, I am unsure whether it is a Xe core for Intel's current hardware.
However if the 2nd semantic proposal is impossible to implement for Level-Zero, then we have to do something about solving the current issue. One way would to try to generalise the current API (urKernelSuggestMaxCooperativeGroupCountExp
) and have a switch on a granularity enum values such as OCCUPANCY_PER_COMPUTE_UNIT
and OCCUPANCY_PER_DEVICE
, so that Level-Zero can freely return Unsupported
for OCCUPANCY_PER_COMPUTE_UNIT
. That may be a little too much overengineering though.
If the above sounds like too much overhead for and unnecessary complication of the interface, I think it will still be okay for us to handle the Cuda requirements correctly in the SYCL runtime wherever this is called for CUDA by dividing the per-device granularity result by the number of SMs. I haven't experimented with that though and I am not entirely confident regarding correctness/reliability. I am also inclined to implement it that way, which would be the least disruptive change as the current interface stays the same. So the ur*
query remains with the current per-device semantics and we calculate per-sm inside ext_oneapi_get_info
from DPC++.
Another suggestion would be to have a different entry-point extension for per-compute unit granularity and leave urKernelSuggestMaxCooperativeGroupCountExp
as is. The new extension will be implemented only for the backends that support that, which at the moment are Cuda and HIP, starting with Cuda initially.
The term Compute Unit
seems to be more OpenCL and AMD focused terminology, but it is also the wording that is used across Unified Runtime and DPC++. There are certain intel specific device info properties that say per_eu (as in per Execution Unit
) instead of per_cu
, but I think since they are Intel-specific marked at the moment, this divergence in terminology is fine.
urKernelSuggestMaxCooperativeGroupCountExp
~~The second point I wanted to raise (currently very important for enabling CUTLASS applications with DPC++) is that the Cuda-targeted applications require to use the mapping to cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
with a different than the default flag (for example, to disable caching of globals, related to Nvidia's Unified L1/Texture Cache, for the query), taken from real use-case in CUTLASS).~
~This means we should add flags parameter that if we generalise the current urKernelSuggestMaxCooperativeGroupCountExp
across all backends, it should take a flags
parameter that can be ignored considering not every backend API has a mapping to that (Cuda and HIP do though HIP actually still ignores the flags). This may require a UR enum type to define a set of meaningful flags across APIs, to start with what Cuda and HIP has at the moment.~
Looks like the CUTLASS porting team has just found out that newer interfaces of the library do not require the WithFlags
version of the API, hence that point is irrelevant for now.
A question for l0 developers is whether l0 driver api for this function has taken into account the use cases of this functionality (e.g. in the use cases similar to ones using sm_occupancy
in cutlass)? Does it support them as much as the l0 devices are inherently capable to support them?
@0x12CC @nrspruit I have updated my comment now after recent findings invalidated one part of it but the semantic of the query whether it is per compute unit or per device is the important matter, the other was a trivial one. I would appreciate feedback. Thanks!
After thinking about this more, this may be a different use case to be honest. It is correct that the maximum cooperative group count is device-wide which for Cuda means per SM occupancy multiplied by the number of SMs, so it is valid that to retrieve the max actively executing group count per SM would have to be implemented separately from this API. However, this API can aid the implementation by giving us the total number that we just have to divide by the number of SMs for Cuda to get the SM Occupancy.
CC @0x12CC @nrspruit
In the discussion from here: https://github.com/oneapi-src/unified-runtime/pull/1246#issuecomment-1894446658
it was described that
urKernelSuggestMaxCooperativeGroupCountExp
maps tocudaOccupancyMaxActiveBlocksPerMultiprocessor
which takes a kernel and other params, and returns the maximum number of blocks that can be simultaneously executed in a streaming multiprocessor (SM).However I found this in the l0 documentation:
"Use zeKernelSuggestMaxCooperativeGroupCount to recommend max group count for device for cooperative functions that device supports."
The "device" word implies that the semantics of of
urKernelSuggestMaxCooperativeGroupCountExp
is the maximum number of blocks that can be simultaneously executed in a device. A device consists of multiple streaming multiprocessors. In such a case you need to multiply the max number of blocks that can be simultanously executed in a SM by the number of SMs in a device.The number of SMs can only be retrieved by querying the device the kernel is to be run on. This information (the device to be run on) is not passed to
urKernelSuggestMaxCooperativeGroupCountExp
, nor can it be inferred from any of the other parameters. Therefore, there are two possibilities: