[CUDA] Annotate invocations with NVTX(TSL)

silvasean commented 1 year ago

Request description

When I profile XLA:GPU with nsys, I see NVTX(TSL) markers indicating the step (taken from the module name). In the screenshot below you can see them as the gray boxes labeled XlaModule:#hlo_module=pjit__wrapped_step_fn,program_id=42

It would be nice if IREE could emit such annotations as well, otherwise it becomes very difficult to separate the different training steps when running a model or looking at a benchmark trace that ran the workload multiple times.

What component(s) does this issue relate to?

Runtime

Additional context

No response

silvasean commented 1 year ago

Maybe want to roll this up into https://github.com/openxla/iree/issues/13145 @ScottTodd -- but this has immediate applicability to my workflow so it would be nice to prioritize.

cc @pjannaty if your team has any availability to work on this.

julianwa commented 1 year ago

@ScottTodd can you take a look and add some notes on steps to do this? As Sean mentioned, @pjannaty might be able to find someone to take it on.

stellaraccident commented 1 year ago

The PJRT plugin owns the top level execution timeline and might be the right place to add annotations. They will overlap.

silvasean commented 1 year ago

I often am reproducing something with iree-benchmark-module and so having annotations there is definitely a requirement (and would be my immediate feature request) even if we have pjrt-level annotations.

stellaraccident commented 1 year ago

If it's coming from the module name that seems reasonable.

silvasean commented 1 year ago

Also in XLA:GPU these annotations can be used for host/device-side correlation as well. here you can see the gray boxes on host (bottom, 2x) and device (top -- device work is much larger and overflows the screen). This is a level of detail that would need to be handled in IREE proper (though I don't have an immediate need for that, it may be relevant in some scenarios).

benvanik commented 1 year ago

I understand the desire for information like this but it's not trivial to do in IREE (or anything like IREE) and asking for "what XLA does" is not really going to help us make progress and is why this stuff doesn't exist yet. Instead of proposing solutions based on XLA it'd be useful to identify the signal and square that with what we can implement. It will not look like XLA and we won't be able to make progress so long as "like XLA" is the request.

I often am reproducing something with iree-benchmark-module and so having annotations there is definitely a requirement

This doesn't make sense to me - iree-benchmark-module runs single invocations and you should know what step you're running as you provided it to the tool. What's your actual workflow here?

ScottTodd commented 1 year ago

As a first pass, you could see where we call IREE_CUDA_TRACE_* functions in places like https://github.com/openxla/iree/blob/dd977b1eb046cfb0946f58a7923946d74d5aaa2d/runtime/src/iree/hal/drivers/cuda/stream_command_buffer.c#L468-L513 and then also insert NVTX instrumentation calls alongside the IREE_TRACE (Tracy) instrumentation calls.

If you've already ran Tracy and are looking for information at a different level, you'll need to find some way to annotate that before going through the compiler (i.e. before fusions and other compiler optimizations) and them plumb that data through in some way.

benvanik commented 1 year ago

Yeah, if what's visible in tracy is sufficient then it's on the list to allow for non-tracy IREETRACE* backends and we could have one that used NVTX's dynamic loading shim (https://github.com/NVIDIA/NVTX/tree/release-v3/c) - you'd set a cmake flag to switch to using NVTX instead of tracy and then get all the same information that we plumb through to tracy.

silvasean commented 1 year ago

I understand the desire for information like this but it's not trivial to do in IREE (or anything like IREE) and asking for "what XLA does" is not really going to help us make progress and is why this stuff doesn't exist yet. Instead of proposing solutions based on XLA it'd be useful to identify the signal and square that with what we can implement. It will not look like XLA and we won't be able to make progress so long as "like XLA" is the request.

I think the definite ask is some level of correlation with the nesting in the user-program invoking iree. The most trivial such nesting is a single module invocation at a time. Ideally we would have full call stack/flame graphs like some other tools provide (this would make my workflow integer factors more efficient). In full generality of out of order execution/etc. this is hard, and some sort of tradeoff will be needed in terms of interpretability/usability and generality.

I often am reproducing something with iree-benchmark-module and so having annotations there is definitely a requirement

This doesn't make sense to me - iree-benchmark-module runs single invocations and you should know what step you're running as you provided it to the tool. What's your actual workflow here?

iree-benchmark-module will run the module N times, and the runs can blur together. Here is an example:

By looking at the repetitive patterns in the per-kernel breakdown it is easy to eyeball the period time and notice that the first iteration is affected by startup overhead. However pinpointing the exact start/end is very difficult. For example, zooming into the last 2-3 iterations, it's clear that the actual step start/end lies somewhere in the middle of a bunch of small dispatches that are hard to pick apart.

Knowing the exact start/end is important for piecing together the overall timeline and correlating back to the source code. If https://github.com/openxla/iree/issues/13145 gives us the user-level callstack/flame graph view then this is easy as well (and solves a lot of the other problems as well)

ScottTodd commented 1 year ago

I think the definite ask is some level of correlation with the nesting in the user-program invoking iree.

Have you tried what was added in https://github.com/openxla/iree/pull/13500? The information is diluted (only the first source location in a fused source location is plumbed through), but that lets you jump from dispatches (cuLaunchKernel) to Python source code as annotated by your ML framework / importer.

stellaraccident commented 1 year ago

Yeah, if what's visible in tracy is sufficient then it's on the list to allow for non-tracy IREETRACE* backends and we could have one that used NVTX's dynamic loading shim (https://github.com/NVIDIA/NVTX/tree/release-v3/c) - you'd set a cmake flag to switch to using NVTX instead of tracy and then get all the same information that we plumb through to tracy.

^-- From a requirements perspective, it would be good to enunciate additional needs beyond what is available in tracy. Things on that list need more tracing. It sounds like we need to prioritize the trace backend work.

silvasean commented 1 year ago

I think the definite ask is some level of correlation with the nesting in the user-program invoking iree.

Have you tried what was added in #13500? The information is diluted (only the first source location in a fused source location is plumbed through), but that lets you jump from dispatches (cuLaunchKernel) to Python source code as annotated by your ML framework / importer.

I haven't tried it yet. In this case the nesting is what is most important. I can already jump to source location relatively easily with my current flow by manually looking up the dispatch name in the IR and clicking the source location in the mlir file in VS Code (this actually gives me pretty good granularity, since I can click on different ops in the dispatch region to see where they came from individually).

pjannaty commented 1 year ago

I know that @kushanam had looked into adding NVTX before. All, do let us know when we have alignment on what instrumentation to add where and we'd be happy to help. Standing by.

benvanik commented 1 year ago

When I wrap up the collective work (~days) I'll get the tracing in an initially pluggable state so we can wire that up to nvtx (and pjrt can do whatever it wants) - as stella mentions concurrently we can evaluate what is missing from the tracing with tracy and get that plumbed through as needed.

iree-org / iree