kokkos / kokkos-tools

Kokkos C++ Performance Portability Programming Ecosystem: Profiling and Debugging Tools
Other
111 stars 57 forks source link

Creating tool to extract kernel launch configuration (block, grid, launch mechanism, ...) #238

Open maartenarnst opened 8 months ago

maartenarnst commented 8 months ago

It can be of interest to developers of Kokkos applications to have some insight into the configuration that Kokkos uses to launch kernels (block, grid, launch mechanism, ...).

However, currently, in Kokkos, the determination of such a launch configuration is implemented typically inside the body of an execute() function. Hence, it cannot be accessed directly. And it seems that to access launch configurations, developers currently are led to use tools like ncu and rocprof. Another option (not sustainable) is to copy-paste pieces of the bodies of the execute() functions to custom functions.

An option may be to extract these functionalities from the execute() functions in Kokkos and put them into dedicated functions that could become part of the api of Kokkos. However, implementing the launch configurations inside the bodies of execute() functions, and thus choosing not to expose them, may have been a deliberate design decision in Kokkos (?).

Thus putting the question here how best to proceed to make it possible to extract launch configuration properties?

If it's not an option to expose them in Kokkos itself, it appears interesting to explore whether gaining insight into launch configurations could be made a part of Kokkos tools. I.e., whether it would be of interest to define new callbacks that can provide the launch configuration and develop a new Kokkos tools connector to collect such information.

@romintomasetti

@dalg24, @masterleinad, @vlkale

vlkale commented 8 months ago

@maartenarnst

Thanks for this. It is a good point.

I don't know how easy it would be to make modifications to Kokkos core and extract out launch configuration code from the execute() function.

I think the solution should involve a new Kokkos Tools callback. I haven't sketched it out in detail but you would need to make changes in profiling/all/ to add this new callback.

cwpearson commented 8 months ago

@maartenarnst what do these tools like ncu and rocprof need as inputs to extract this information (e.g. a pointer to the kernel function?). If its runtime information like that, my first thought would be that Kokkos should pass that information to Tools through an appropriate interface and then Tools can use it as needed.

I guess one issue I see is that Core will do things like launch mechanism and parameters before actually launching the kernel, so we'd have to resolve how to give Tools enough information to correlate that with the following kernel launch and plumb that through Core.

If it's static information, perhaps it can be integrated with the PR @dalg24 referenced above.

romintomasetti commented 7 months ago

With @maartenarnst, we think there is a bigger picture question we should answer before we go on.

What should Kokkos Tools be able to do ?

It seems that backend-specific information like launch grid, scratch size and so on can always be extracted using the backend-vendor tools (e.g. ncu for CUDA or rocprof for HIP).

So one question we have is:

What should Kokkos Tools be able to provide ? Should it also be able to provide information that Kokkos has (e.g. grid size) but that can be extracted using vendor tools?

In other words:

What is the scope of Kokkos Tools ? Should it collect backend-specific information that backend tools can already provide ?

In other words:

Is Kokkos Tools a a drop-in replacement (e.g. for easy and direct to kernel info in preliminary benchmark studies), or just a substitute for "missing" features of vendor tools ?

For instance, it seems the functor size is not easy to retrieve with ncu (because ncu only "sees" the driver), so it would make sense to provide it with Kokkos Tools. But the launch grid is easy to retrieve with vendor tools, so is it in the scope of Kokkos Tools to provide such details?

@crtrott @dalg24 @maartenarnst @masterleinad @cwpearson @vlkale

vlkale commented 7 months ago

@romintomasetti @all

tl;dr to answer @romintomasetti question: I also think launch grid configuration is in the scope, assuming a Kokkos user can gain some insight from apples-to-apples comparison of launch configurations across different vendor tools.

I elaborate below, though we may want to move this elaboration to another Kokkos Tools github issue:


You can take a look at the Kokkos Tools documentation README.md and the wiki for the scope and purpose of Kokkos Tools, but let me summarize and target it in the context of your question:

Consider the problem of Kokkos function name demangling that one would have without Kokkos Tools. The problem is not (just) that reading the function name is hard for a Kokkos user running on one particular backend. I think the more fundamental problem comes in portable tooling: How does one compare timings of a particular Kokkos::parallel_for run on an AMD GPU (with the HIP backend) with that of an NVIDIA GPU (with CUDA backend)? Kokkos Tools provides for an apples-to-apples - portable - comparison of a labeled Kokkos kernel across the two different vendor GPUs. Otherwise, the Kokkos programmer has to take time doing such a comparison on their own (note how this directly corresponds to effort of programming and maintaining CUDA and HIP backend if he/she didn't have Kokkos).

So, to answer the question: I think launch grid configuration is in the scope, but this is assuming a Kokkos user can gain some insight from apples-to-apples comparison of launch configurations across different vendor tools. More generally, any tooling for Kokkos program is in scope if it has meaning across different Kokkos backends is in Kokkos Tools.