iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 568 forks source link

End-to-end support for reusable indirect command buffers. #17875

Open benvanik opened 1 month ago

benvanik commented 1 month ago

Tracking remaining work on the runtime and compiler side for reusable command buffers (using indirect command buffers). #10144 has some background on the approach but is out of date at this point. This is the design doc for now and will link to associated work items remaining.

Terminology

A command buffer is reusable if it can be submitted to the device for execution multiple times from a single recording and is otherwise one-shot. A command buffer is indirect if it uses substitution to allow updates to the command buffer after it has been recorded. The concepts are orthogonal: a command buffer can be one-shot and indirect or reusable and direct. For most practical usage reusable command buffers are indirect.

A direct buffer reference is a concrete iree_hal_buffer_t* that exists at the time that the command buffer is recorded and that is retained for the lifetime of the command buffer. Every submission of the command buffer will use the exact same buffer pointer. An indirect buffer reference is symbolic at the time of recording and the concrete buffers are provided via a lookup table on each submission. Every submission can use a different set of buffers.

image

A binding table is used to pass the references to the symbolic substitutions used for an indirect command buffer. The substitutions are treated as if the command buffer had been recorded with them at the time of submission and then discarded.

image

Motivation

Today when we produce command buffer recording sequences in the compiler we do so with the IREE_HAL_COMMAND_BUFFER_MODE_ONE_SHOT flag. The one-shot command buffers are created, recorded, submitted, and then discarded immediately after submission. This means that each time a particular block of code issuing commands is executed we spend the time making the calls through the VM->HAL module, HAL API implementation, validating all of the commands for correctness, and calling into underlying device APIs that do non-trivial work (allocations (hopefully pooled), its own validation, etc). This comprises nearly all of the CPU time required to issue a batch of commands. Submission of the recorded commands usually has a fixed cost dwarfed by the actual recording of the commands (10-50us vs 1-2ms). For models of reasonable complexity we generally have ~1000 commands with ~1500-2000 recording calls (though we should be able to be much closer to the command count with compiler optimizations). Even if each recording call takes 1us that's 1ms to record everything, and many take a handful or more depending on target.

It's possible to split the CPU time here into two parts: recording the structure of the command buffer and replaying that recording. So long as the structure is not changing (which commands are executed and in what order) we should be able to replay the same recording multiple time. The heavy-weight validation, allocation for growth, and chatty API calls used to record are then separated from the replay step that just issues those commands.

image

For device targets that natively support reusable command buffer-like representations at the driver level (VkCommandBuffer in Vulkan, CUgraph/hipGraph in CUDA/HIP, MTLIndirectCommandBuffer in Metal, and ID3D12CommandList in D3D12) we can record directly into those and push the chatty driver API calls on the recording side of the time balance. Replaying then becomes (roughly) a single API call regardless of the number of commands originally issued. Whereas we may have needed 1000-2000 driver API calls per invocation scaling with the complexity of the input program we now only need (roughly) 1. As discussed later there are costs to getting into this mode but going from O(n) to O(1) is worth adding potential inefficiencies in other parts of the system such as extra indirections in kernels even if a microbenchmark of a single kernel shows that it harms that one kernel.

For device targets that don't support reusable command buffers (or don't yet) we can still reap some benefits of splitting the CPU time by eliminating the IREE VM, HAL module, and HAL API calls from the hot path by recording into our in-memory iree_hal_deferred_command_buffer_t and replaying that against the native target APIs. It will always be strictly faster to do so given that nearly half the work of a particular command (gathering args, making VM<->native calls, unboxing references, performing validation, etc) is eliminated in the hot path.

Static Buffers

In order to reuse a direct command buffer all buffers must have fixed addresses. This is not the case in most programs as the user is often providing input and output storage that changes call-to-call. Though sometimes possible to excise static portions of the command sequence it's unreliable enough that it cannot be relied upon to get consistent reuse and deterministic overheads.

Transient allocations made in device queue order are critical to keeping peak memory consumption fixed in pipelined asynchronous systems. These allocations are made at submission time and may produce new buffers each submission as buffers are pulled from pools or suballocated from larger data structures with offsets that vary. Even with a fixed number of outstanding submissions and bounded pools it's not feasible to require one copy of each command buffer per unique set of buffers when factoring in user-provided buffers or buffers that may change over time in a stateful program.

Here we assume that all buffers may change at any time and static buffers are an optimization that can be used to reduce the amount of information that must be provided per-submission at runtime. For our purposes we consider the underlying physical allocation and the subrange within any particular physical allocation as being potentially dynamic, meaning that commands referencing the same fixed ringbuffer may still be referencing dynamic offsets each submission.

Dynamic Shapes

Dynamically shaped tensors provided as inputs or computed as part of the program's execution are a baseline expectation in the system. It's near impossible to retrofit dynamic shapes into an architecture that was designed for static ones as evidenced by the trail of abandoned efforts over the years in other frameworks. We assume all shapes can be dynamic and treat static shapes as an optimization to allow improved aliasing and fewer calculations but otherwise design the system with the expectation that dynamic shapes are present (even if partially).

Some of the more sophisticated execution systems today (mostly eager) only support dynamic shapes known at the program boundaries. We are targeting ahead to programs that can vary their shapes based on calculated results ("a tensor with all elements <= 0.5"), external state (file/network/user IO), or internal state (invocation-to-invocation globals).

Control Flow

Though nascent we seek to allow for simple forms of on-device control flow. There's many ways to accomplish this and in the future we are likely to grow rich support for device-side dispatches to process condition regions and loops, but to start basic features like predication can be implemented using many of the same primitives in indirect command buffers. For example a conditional selecting between two compiled variants of a dispatch depending on the workload of a particular dispatch can be accomplished by producing a valid workgroup count for the selected dispatch and a 0x0x0 count for the other. Re-recording command buffers on changes to these conditions does not scale as the condition itself may be calculated on device, may change per invocation, or may happen 1000 times with unique conditions within a single command buffer. This combinatorial explosion makes it impossible to deliver deterministic performance in the face of arbitrary user programs and inputs without proper support and much can be accomplished with predication when available.

Approach

Note: some of this work (particularly on the HAL API side) has landed already. Note: the HAL does not require support for reusable/indirect command buffers and a compiler flag will be able to control whether the compiler generates usage of them - for legacy targets we can still emit the same one-shot usage we have today.

HAL API Changes

iree_hal_command_buffer_create gained a binding_capacity that defines the size of an optional symbol table that may be referenced during command buffer recording. All calls that can reference buffers (iree_hal_command_buffer_copy_buffer, iree_hal_command_buffer_push_descriptor_set, iree_hal_command_buffer_dispatch_indirect, etc) were changed to take a iree_hal_buffer_ref_t instead of an iree_hal_buffer_t*. Any of those buffer references can point to a direct iree_hal_buffer_t* (as before) or an indirect symbolic slot in the binding table as defined by the binding_capacity. A command buffer can have any mix of direct and indirect references. Not all of the binding table capacity requested must be used.

Recorded command buffers are submitted for execution with iree_hal_device_queue_execute but now that takes an optional iree_hal_buffer_binding_table_t per command buffer. This binding table is a flat list of iree_hal_buffer_t* references and subranges that correspond to the symbolic slots declared when the command buffer was recorded (up to binding_capacity). This allows the same command buffer to be submitted multiple times with different bindings or different subranges of those bindings. Any binding table slots not used by the command buffer may be left empty and only up to the maximum slot used needs to be provided.

Example using direct buffer references:

iree_hal_command_buffer_create(..., IREE_HAL_COMMAND_BUFFER_MODE_ONE_SHOT, ...
                               /*binding_capacity=*/0, &command_buffer);
iree_hal_command_buffer_begin(command_buffer);
iree_hal_command_buffer_fill_buffer(command_buffer,
    iree_hal_make_buffer_ref(target_buffer, /*offset=*/0, /*length=*/8),
    &zero, /*pattern_length=*/1);
iree_hal_command_buffer_end(command_buffer);
iree_hal_device_queue_execute(..., &command_buffer, /*binding_tables=*/NULL);
// Must re-record if target_buffer changes to issue again!

Example now using indirect buffer references for reuse:

// Done once with a symbolic reference (no buffer pointer needed to record).
iree_hal_command_buffer_create(..., /*flags=*/0, ...
                               /*binding_capacity=*/1, &command_buffer);
iree_hal_command_buffer_begin(command_buffer);
const int kTargetBufferSlot = 0;
iree_hal_command_buffer_fill_buffer(command_buffer,
    iree_hal_make_indirect_buffer_ref(kTargetBufferSlot, /*offset=*/0, /*length=*/8),
    &zero, /*pattern_length=*/1);
iree_hal_command_buffer_end(command_buffer);

// << some time in the future >>
const iree_hal_buffer_binding_t bindings_a[] = {
    /*kTargetBufferSlot=*/{target_buffer_a, 0, IREE_WHOLE_BUFFER},
};
iree_hal_buffer_binding_table_t binding_table_a = {
  .count = IREE_ARRAYSIZE(bindings_a),
  .bindings = bindings_a,
};
iree_hal_device_queue_execute(..., &command_buffer, &binding_table_a);

// << further in the future >>
const iree_hal_buffer_binding_t bindings_b[] = {
    /*kTargetBufferSlot=*/{target_buffer_b, 0, IREE_WHOLE_BUFFER},
};
iree_hal_buffer_binding_table_t binding_table_b = {
  .count = IREE_ARRAYSIZE(bindings_b),
  .bindings = bindings_b,
};
iree_hal_device_queue_execute(..., &command_buffer, &binding_table_b);

That's the extent of the HAL API change (binding_capacity, iree_hal_buffer_ref_t, and iree_hal_buffer_binding_table_t). From that each HAL device implementation can decide how best to perform the requested replay (discussed below).

Validation

Most of the command buffer validation we performed can still happen with indirect command buffers. During recording the requirements for each binding slot are accumulated based on the commands that reference the respective slot. For example if slot 3 is used for both 32-bit fill and dispatch it will be marked as needing both the IREE_HAL_BUFFER_USAGE_TRANSFER_TARGET and IREE_HAL_BUFFER_USAGE_DISPATCH_STORAGE usage bits and at least a 4 byte alignment. The validation of binding requirements is performed per submission and scales with the total number of unique bindings instead of the total number of recorded commands. Some specific validation (namely buffer overlap detection for copies) is not currently performed as doing so would be too costly.

Supporting Inline Execution

The IREE_HAL_COMMAND_BUFFER_MODE_ALLOW_INLINE_EXECUTION flag is used frequently as a workaround to allow streaming-based APIs (CUDA/HIP and synchronous CPU) to execute during recording and avoid the latency hit of a full record + replay. This is only needed because we are issuing one-shot command buffers: the only CPU time it saves is that used to record the commands. The flag itself is dangerous and hamstrings the compiler: in that mode it's not possible to issue concurrent command buffers or even record multiple command buffers simultaneously and introduces blocking behavior in a part of the execution that is expected to be non-blocking. As part of this work the flag will be removed from the HAL. CUDA/HIP will be better served with graphs and synchronous single-threaded CPU should use the inline HAL instead that avoids all the overhead associated with command buffers and the drops the code size of the HAL by 90%.

Compiler Changes

stream.cmd.execute is our stream-level primitive for a batch of commands with a fixed set of resources that execute atomically on device. Today this is lowered inline into the command buffer record and submit sequence:

stream.cmd.execute ... with(%res as %captured_res: !stream.resource<transient>{%size}, ...) {
  stream.cmd.fill %c255_i32, %captured_res[%c0 for %c128] : i32 -> !stream.resource<transient>{%size}
}

->

%cmd = hal.command_buffer.create ...
hal.command_buffer.fill_buffer<%cmd : !hal.command_buffer> target(%captured_res : !hal.buffer)[%c0, %c128] pattern(%c255_i32 : i32)
hal.command_buffer.finalize<%cmd : !hal.command_buffer>
hal.device.queue.execute ... commands([%cmd])

Now conversion will (where profitable) nest the command buffer creation in a hal.device.memoize op:

%cmd = hal.device.memoize<%device : !hal.device> {
  %recorded_cmd = hal.command_buffer.create ... bindings(1)
  hal.command_buffer.fill_buffer<%recorded_cmd : !hal.command_buffer> target(@0)[%c0, %c128] pattern(%c255_i32 : i32)
  hal.command_buffer.finalize<%recorded_cmd : !hal.command_buffer>
  hal.return %recorded_cmd : !hal.command_buffer
}
hal.device.queue.execute.indirect ... commands([%cmd]) bindings([
  (%res : !hal.buffer)[%c0, %size]
])

The hal.device.memoize op is per-device and returns a reusable command buffer that can be submitted with a binding table. The op nests the recording and yields the command buffer produced. Any dynamic buffer references are turned into symbolic references (@0 for slot 0) while any constant references (immutable globals/etc) are captured as direct references. The op can be used to memoize other things (buffers, etc).

Following stream->hal conversion a pass will run that processes the hal.device.memoize ops using device analysis so that it can stamp out as many per-device variants as required. This allows the compiler to specialize the memoized code paths while knowing that only a single device category/class is possible. For example if a particular hal.device.memoize is used on two homogeneous devices the code will be emitted once and invoked twice to create one resource per device while if used on two heterogenous devices the code will be emitted twice and specialized for each. After the pass the memoized results will be replaced with a lookup of the appropriate memoized resources. This pass can fold device queries that are nested inside of it or a canonicalizer may be added to do so.

To start memoization will happen assuming that all inputs are available at initialization time. This is not required: we could memoize with dynamic inputs but that requires some more infrastructure (weak references for an LRU, etc). The pass that memoizes will create one global per device per result of the memoize op and one initializer per device that populates them:

util.global private @__memoized_foo_device_a : !hal.command_buffer
util.initialize {
  // hoisted constants/global loads/etc
  %recorded_cmd = hal.command_buffer.create ... bindings(1)
  hal.command_buffer.fill_buffer<%recorded_cmd : !hal.command_buffer> target(@0)[%c0, %c128] pattern(%c255_i32 : i32)
  hal.command_buffer.finalize<%recorded_cmd : !hal.command_buffer>
  util.global.store %recorded_cmd, @__memoized_foo_device_a
  util.return
}
util.global private @__memoized_foo_device_b : !hal.command_buffer
util.initialize {
  ...
  util.return
}

To start the original memoize op will be replaced with a hideous table of comparisons of the device to the known device globals:

%is_a = ... is %device == @device_a
%cmd_ab = scf.if %is_a {
  %cmd_a = util.global.load immutable @__memoized_foo_device_a : !hal.command_buffer
  scf.yield %cmd_a
} else {
  %is_b = ... is %device == @device_b
  %ret = scf.if %is_b {
    %cmd_b = util.global.load immutable @__memoized_foo_device_b : !hal.command_buffer
    scf.yield %cmd_b
  } else {
    util.fail
    scf.yield %null
  }
  scf.yield %ret
}
hal.device.queue.execute.indirect ... commands([%cmd_ab]) bindings([
  (%res : !hal.buffer)[%c0, %size]
])

Future work would be to instead generate functions that rely on constexpr hoisting to end up as globals. This requires more sophisticated analysis than we have today to be as efficient. The lookup could also be simplified to keep IR simpler for longer (we used to have hal.device.switch) but for now the expanded form should work. Hopefully.

Dynamic Resource Ranges

Any dynamic range on a resource e.g. %resource[%offset for %length] can be hoisted out and placed on the binding table. This allows for programs with dynamic tensor shapes that resulted in dynamic offsets to still reuse command buffers even if the exact ranges may change each invocation. The conversion step is responsible for doing the split and only putting constant values in the command buffer. It's possible to optimize this a bit by using both: for example, if (as is often the case) a starting offset is calculated by an arith.addi %dynamic_value, %c100 (usually from a packing table) we could record the command with %c100 and use %dynamic_value as the offset in the binding table. Doing so allows earlier validation during recording and may help symbol CSE fold memoized regions better. Or it may hurt them 🤷

Dynamic Parameters

Buffers are easy to capture and turn into symbolic binding table slots. Primitive values that end up as push constants are trickier. Instead of having a symbolic table of primitives we will use parameter buffers as is common in GPU programming. Push constants will still be used when a value is constant at command buffer recording time but all others will be allocated and stored into a transient buffer that is dereferenced by dispatches. This process will likely happen in the stream dialect after dispatch parameter optimization such that we have the smallest set of non-constant/uniform values after executable deduplication.

If any dispatch within a stream.cmd.execute takes a dynamic value (such as a dynamic shape dimension) the execution region will be given a transient buffer allocated with capacity for all unique dynamic values. This is usually a very small number as dimensions and other dynamic expressions are shared after extensive IPO/CSE. The population of the transient buffer (uploading the values into it) will happen prior to the region and the buffer will be passed as if it was any other resource and thus end up as an indirect reference during recording. The dispatches will have their ABI updated to take the transient buffer (if they need it) and load the appropriate value instead of relying on the function argument as before. Because we are still in the stream dialect inserting the asynchronous alloca/upload/dealloca is easy:

%some_value = dynamic
%transient, %transient_alloca_timepoint = stream.resource.alloca uninitialized : !stream.resource<transient>{%c4} => !stream.timepoint
%execute_timepoint = stream.cmd.execute await(%transient_alloca_timepoint) => with(%transient as %captured_transient: !stream.resource<transient>{%c4}) {
  stream.cmd.dispatch @ex::@dispatch(%some_value : i32) {
    wo %captured_transient[%c0 for %c4] : !stream.resource<transient>{%c4}
  }
} => !stream.timepoint
%transient_dealloca_timepoint = stream.resource.dealloca await(%execute_timepoint) => %transient : !stream.resource<transient>{%c4} => !stream.timepoint

->

%some_value = dynamic
%transient, %transient_alloca_timepoint = stream.resource.alloca ...
%parameters, %parameters_alloca_timepoint = stream.resource.alloca ...
// TBD: nice way to bulk upload %some_value to %parameters, maybe a hal.buffer.store that took variadic args, etc
%alloca_timepoint = stream.timepoint.join max(%transient_alloca_timepoint, parameters_alloca_timepoint)
%execute_timepoint = stream.cmd.execute await(%alloca_timepoint) =>
    with(%transient as %captured_transient: !stream.resource<transient>{%c4},
         %parameters as %captured_parameters: !stream.resource<transient>{%c4}) {
  stream.cmd.dispatch @ex::@dispatch {
    ro %captured_parameters[%c0 for %c4] : !stream.resource<transient>{%c4}
    wo %captured_transient[%c0 for %c4] : !stream.resource<transient>{%c4}
  }
} => !stream.timepoint
%transient_dealloca_timepoint = stream.resource.dealloca await(%execute_timepoint) ...
%parameters_dealloca_timepoint = stream.resource.dealloca await(%execute_timepoint) ...
%dealloca_timepoint = stream.timepoint.join max(%transient_dealloca_timepoint, %parameters_dealloca_timepoint)

After conversion where the stream.cmd.execute is memoized the parameters buffer will be on the execution side while only the indirect reference will be on the recording side.

There's other ways of doing this - for example, we could have a shared ringbuffer resource of parameters that we write into and then use a subset of during execution. This would avoid the additional alloca/dealloca overhead at the cost of possibly limiting concurrency or maximum parameter count. We could also not use stream ordered allocations as the parameters are allocated/written early and can be cached. This is also a place where we could use memoization: a hal.device.memoize returning a per-device per-execution region ringbuffer would only limit the number of outstanding submissions of a particular execution region to some fixed count. TBD and probably terrible to start.

Dynamic Transfer Commands

Most targets don't allow for updating copy/fill command buffers or ranges (called out below). This means that we would need to generate emulated variants of these commands by way of dispatches in any case where an indirect buffer was referenced. This is something we could do universally to start or conditionally do based on device capabilities. Whether we want to ship runtime builtins (as we do for Metal and Vulkan) that handle these cases or do them in the compiler is TBD.

Dynamic Workgroup Counts (🚧🚧 WIP 🚧🚧)

Workgroup counts are special for a variety of reasons but the primary one that means they'll be treated differently than parameters is that at the time we are representing programs in the stream dialect we haven't yet collapsed the workload passed to each dispatch (0 or more values) into the 3D grid XYZ. This means we can't easily allocate space for them when they are dynamic, produce the parameter buffers with them, etc.

My current plan is to turn any dispatch with a dynamic workload into an indirect dispatch (bottoming out on iree_hal_command_buffer_dispatch_indirect). At the same time we are extracting dynamic parameters above we'll extract dynamic workgroup counts and make the stream.cmd.dispatch.indirect that takes a resource range for the workgroup count. Today the stream.executable.export workgroups region converts workloads into workgroup counts and it needs to remain there to allow for codegen to vary the logic per target. To allow the result of the export workgroup count to be used before target codegen specialization a new stream.executable.export.workgroups op will be added that given an affinity, executable export reference, and workload will produce the workgroup count XYZ. During conversion into HAL the op will lower into the same inlining logic as performed today.

stream.cmd.execute on(#device) ... {
  stream.cmd.dispatch @ex::@dispatch[%workload_a, %workload_b] ...
}

->

%workgroups:3 = stream.executable.export.workgroups on(#device) @ex::@dispatch[%workload_a, %workload_b] : i32, i32, i32
%parameters = ... upload as above ...
stream.cmd.execute on(#device) ... {
  stream.cmd.dispatch.indirect @ex::@dispatch[%parameters[%c0 to %c12]]
}

As future work we can use this same mechanism to move workgroup count calculations to the device for adaptive workgroup counts (data-dependent workgroup counts that useful in things like dynamically shaped tree reductions or dynamic tiling). Any sequence we have today of a dispatch followed by a host load (flow.tensor.load/etc) could instead be turned into a dispatch of the logic following the host load and its arithmetic. Because this is convergent with the above design we can start doing basic predication and loop logic even without full command processor control.

The caveat at this point is that even with dynamic dimensions the workgroup count may end up as static and we don't have a way of knowing that now, meaning we're using indirect dispatches where one may not be required. Thankfully that's exceptionally unlikely (if a workload has tensors with dynamic shapes then the workgroup count is often dynamic as well) without a lot more analysis that hopefully we could optimize earlier in some other way but this assumes worst case current behavior. Ideally we'd be a bit more clever about which values we pass as the workload when forming dispatch regions.

The larger issue discussed below is that indirect dispatch is a extra work in CUDA and (AFAICT) not possible in HIP.

Codegen Changes

As each target gains native support for reusable command buffers it's likely that there will need to be codegen changes in how descriptors are referenced. Today hal.interface.binding.subspan references a descriptor set ordinal and binding ordinal that corresponds to the concrete set/bindings at runtime via the iree_hal_command_buffer_push_descriptor_set command. The MaterializeInterfacesPass will gain the ability to emit variants of the subspan op that references a binding table slot instead (maybe hal.interface.binding.subspan slot(0) instead of hal.interface.binding.subspan set(0) binding(0) etc). How that is lowered to each target is part of the ABI agreement between the runtime and compiler.

As all implementations are initially emulated there will be no changes. Until a TargetBackend opts into native support the pass will still emit the existing form, though the stream dialect analysis will start tracking which bindings are from the binding table sooner.

Target Implementation

To start all targets use iree_hal_deferred_command_buffer_t when any reusable command buffer or command buffer with a binding capacity > 0 is requested. The indirect command buffer is recorded as a deferred command buffer and then upon submission the target runtime HAL implementations replay it against their own native direct command buffers as if it had been originally recorded that way. The advantage of the recording is that the replay can happen asynchronously: a thread pumping pending action queues is what actually does the driver API calls instead of as previously the main CPU thread running the VM. This enables pipelining even on synchronous devices.

For the most part this leads to an identical execution as the one-shot case with the exception of indirect dispatches: today the compiler never generates them and the CUDA and HIP implementations bail with an UNSUPPORTED error. That will need to change :)

This approach allows for the compiler to be upgraded to produce reusable command buffers without needing all targets to simultaneously support them natively. Some targets may never support them and may choose to rely on the shared emulation via iree_hal_deferred_command_buffer_t or their own more optimized version and that's ok: they'll still get the benefits of amortizing the IREE VM overhead and a bulk of the validation costs as well as getting to execute asynchronously.

Local CPU Synchronous (local-sync)

There's vanishingly few uses for this outside of debugging and no effort will be spent on it beyond the default emulation. Users who can assert single-threaded synchronous CPU execution should use the inline HAL if concerned with performance.

Local CPU Task System (local-task)

Today the task system's iree_task_t is mutable and not designed to be reused (#5250). It may not make sense to make it so given the performance benefits during execution. It would make sense not to use iree_hal_deferred_command_buffer_t, though, and instead a data structure more optimized for quickly creating the task graph for execution that precaches things like task node templates, the memory required for the total task graph, and the relationships between task graph nodes (storing an edge table).

CUDA/HIP (Graphs)

(this general approach applies to both CUDA and HIP, target-specific details below)

CUDA graphs can be used in two ways to achieve indirect command buffers: by updating the graph prior to launching it with the substituted values or by using an indirection in the kernels through a buffer stored out-of-band. The advantage of the latter approach is that one or two updates (calls through to CUDA driver APIs) could handle any number of commands or bindings with a fixed cost, while needing to update all commands would scale with the number of commands using indirect buffers. In both cases the graph is constructed (mostly) the same way initially and stored for replay as a CUgraphExec handle.

The canonical graph update approach would require storing the CUgraphNode for each command that referenced a symbolic binding table slot as well as a map of binding table slot to the list of nodes (and which fields) it was used for. In the worst case this is O(num_commands) updates (if each command used at least one indirect binding).

// On command buffer: memcpy/memset/kernel update params for each node.

// Update the graph prior to launch, for each command type:
for (iree_host_size_t i = 0; i < command_buffer->memcpy_update_count; ++i) {
  iree_hal_cuda_graph_memcpy_update_t* update = &command_buffer->memcpy_updates[i];
  CUDA_MEMCPY3D new_params = update->params_template;
  if (update->fields & IREE_HAL_CUDA_GRAPH_MEMCPY_UPDATE_SOURCE_BUFFER) {
    new_params.srcDevice = iree_hal_cuda_buffer_device_address(binding_table.bindings[update->source_slot]);
  }
  if (update->fields & IREE_HAL_CUDA_GRAPH_MEMCPY_UPDATE_TARGET_BUFFER) {
    new_params.dstDevice = iree_hal_cuda_buffer_device_address(binding_table.bindings[update->target_slot]);
  }
  cuGraphExecMemcpyNodeSetParams(command_buffer->exec_handle, update->node, &new_params, ...);
}
// Same loop for memset nodes...
// Same loop for kernel launch nodes...

// Launch now using a snapshot of the current updates:
cuGraphLaunch(command_buffer->handle, target_stream);

Even if each cuGraphExec*SetParams call was 100ns that's 100us for 1000 commands, and 100ns is an extremely aggressive estimate given the likelihood of cache misses while walking the large data structures both on our side and the driver side.

Instead, we could have a per-command-buffer block of mapped device memory containing the device addresses in the binding table per launch, update that in device memory, and launch the graph. There's several approaches to this that could be done depending on what is fastest. One example is ringbuffer of host-side memory populated on launch and asynchronously copied into a block of persistent device memory used by the graph:

// On command_buffer: CUdeviceptr host_binding_ptrs[max_outstanding][binding_capacity];
// On command_buffer: CUdeviceptr device_binding_ptrs;

// Acquire a free copy of the host-side pointers memory from the command buffer cache.
CUdeviceptr* host_binding_ptrs = iree_hal_cuda_graph_command_buffer_acquire_host_binding_ptrs(command_buffer);
for (iree_host_size_t i = 0; i < binding_table.count; ++i) {
  host_binding_ptrs[i] = iree_hal_cuda_buffer_device_address(binding_table.bindings[i]);
}

// Copy from the host memory into device memory (or memcpy in mapped memory instead, etc).
cuMemcpyHtoDAsync(command_buffer->device_binding_ptrs, host_binding_ptrs, binding_table.count * sizeof(CUdeviceptr), target_stream);

// Launch with the latest binding table values.
cuGraphLaunch(command_buffer->handle, target_stream);

Lots of ways to do it - ringbuffers on the device instead of per-command-buffer, transient allocations in the graph, etc. The key is that the amount of CPU work should scale only with the number of bindings, the CPU->GPU traffic should be minimal (ideally just binding_count * sizeof(CUdeviceptr)), the driver should not be involved (no need to worry about graph update performance), etc.

Once the binding table (<10 entries usually, but could be ~4000 entries and ~32KB) is uploaded to the device the next step would be to scatter the binding table entries into the parameter buffers for dispatches that use them. This should happen on device with a scatter kernel: as part of recording the command buffer a scatter map is constructed and persisted on device that maps binding table indices to addresses in a parameter buffer such that no additional host/device transfers are needed. The scatter can execute parallelized and only needs to happen once per command buffer and best of all there's no additional API overhead. Kernels will need to take the parameter buffer pointer and indirect through it but since it will be uniform for all threads the overhead should be minimal. Importantly with this approach (instead of one where kernels reference the binding table) we can get kernel reuse across multiple command buffers as the kernels themselves just reference their relative parameter buffer offsets instead of the global binding table ordinals.

image

CUDA

Indirect dispatch (aka adaptive grid sizes) requires the use of CUDA Dynamic Parallelism (CDP) as unfortunately there's not a way to do it natively as with Vulkan/Metal/D3D12. The supported approach is to have a tiny hand-authored kernel launched with 1x1x1 that takes a device function pointer and buffer containing the grid size as parameters and then uses device-side launch to call the function with the dereferenced grid size. The key is to use cudaStreamTailLaunch introduced in CDP2 - this enqueues the launch, returns from the kernel, and then immediately begins executing the launched kernel. The overhead is one additional kernel launch on the device side (so no host overhead). See https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf 6.2.8.7.7.9 Tail Launch.

I believe the rough implementation looks like cudaMemcpyFromSymbol (IIRC just cuModuleGetGlobal + a cuMemcpyDtoH on the host) to get the target function pointer and having the trampoline kernel copy over the parameter buffer into the result of a cudaGetParameterBuffer and then cudaLaunchDevice (which lowers to some PTX magic). nvcc dumps/PTX disassembly could be useful to get the exact sequence between user-level C++ and what we care about (the PTX).

There's some examples discussions about the performance of this on the CUDA forums (e.g. https://forums.developer.nvidia.com/t/performance-drops-with-dynamic-parallelism/294804/10) - from nvidia they claim tail-launch latencies are effectively noise. It's useful to keep in mind that here we're using this trampoline launch to avoid several microseconds up to several milliseconds of work (or much longer, if a host round-trip is required to do read back the grid size from the device to then launch again in the stream!). A fixed 1-5us of command processing time per indirect dispatch (irrespective of the total workgroup count of that dispatch) is inconsequential.

HIP

I don't know what to do. HIP doesn't support dynamic parallelism and given that they replicate the CUDA API which has no way of doing indirect dispatch without dynamic parallelism HIP is at a disadvantage. Bah.

I'm still thinking of workarounds and suggestions are very welcome. This is the biggest blocker for reusable command buffers and HIP may require using the emulation path until it is figured out.

The nasty hack I'm thinking about is a command buffer mode flag indicating that the indirect dispatch workgroup count is always available in the referenced binding table buffer at the time the submission is ready to execute. We could then have the pending action queue go read the workgroup counts back to the host and update the graph with cuGraphExecKernelNodeSetParams on gridDimX/Y/Z with the values prior to launching. This would require the command buffer to track the indirect dispatch commands and their corresponding graph nodes so that they could perform the update. Since it's a submission-level operation it'd be batched once per command buffer launch instead of once per dispatch within, but it's still additional latency that is unfortunate.

This workaround should be fine to start with all current programs we have but won't work with actual indirect dispatch where the workgroup counts are produced within the command buffer. That will happen when we start hoisting sequences of flow.tensor.load + arithmetic + workgroup count calculation into dispatches to avoid the host round-trips. When that does inevitably happen we'd probably end up needing to partition the graph into subgraphs, insert host callbacks prior to indirect dispatches that read back the results, and then inline launch/sequence the target kernel immediately after. That's effectively doing what the program would be expressing in the compiler via a more heavy-weight process (full submission/back to the VM to do some stuff/full submission) but only in the HIP driver and with the ability to do host callbacks, so it's still an improvement just not nearly competitive with all other implementations.

Vulkan

Though Vulkan has reusable command buffers it does not have a way to update the parameters within them. It does support indirect dispatch, though, so the only support required is for the binding table references. The benefit of using native Vulkan command buffers is that we will have a single Vulkan API call (vkQueueSubmit) for each invocation compared to the expensive per-command calls through the Vulkan ICD and into the drivers as well as the command pool allocation management.

If we could rely on large enough descriptor set capacity such that we could fit the entire binding table in its contents we could emit SPIR-V shaders that had a descriptor set with one binding per table lot. Unfortunately most devices are limited to a very small number and don't support update-after-bind (or at least not in queue-order). Since there's no asynchronous descriptor set update API and descriptor set references are baked into the command buffer it also means that only one outstanding submission would be allowed at a time. This is non-viable.

The alternative (which we want anyway for 64-bit support and such) is to use device addresses (#13945) to upload the binding table and then scatter the table to per-dispatch parameter buffers as with the CUDA/HIP graph approach described above. Each command buffer will allocate a device-local/host-visible binding table buffer of device addresses that will be populated prior submitting a command buffer for execution and all SPIR-V shaders using indirect bindings will be changed to use that parameter buffer bound to a well-known descriptor set for getting device addresses. By restricting reusable command buffers to a single concurrent invocation per device queue (sufficient for our current needs) a submission sequence would look like:

VkDeviceAddress device_ptrs[i];
for (iree_host_size_t i = 0; i < binding_table.count; ++i) {
  device_ptrs[i] = iree_hal_vulkan_buffer_device_address(binding_table.bindings[i].buffer) + binding_table.bindings[i].offset;
}
VkCommandBuffer update_cmd = ... acquire from pool ...;
vkCmdUpdateBuffer(update_cmd, reusable_command_buffer->binding_table_buffer, 0, binding_table.count * sizeof(VkDeviceAddress), device_ptrs);
vkCmdExecutionBarrier(update_cmd);
// bind persistent descriptor set pointing at parameter buffer
// bind scatter shader
vkCmdDispatch(update_cmd, ...);
vkQueueSubmit(... {update_cmd, reusable_command_buffer});

This allows us to have multiple invocations of the same command buffer in-flight on the same queue as the binding table will be updated immediately prior to execution in the queue timeline.

Copies and fills cannot have their target buffers changed and will need to be emulated with dispatches.

Metal

There are a few approaches here we could use. The simplest would be to record a MTLIndirectCommandBuffer referencing a shared MTLBuffer that is updated in the queue timeline with [MTLBuffer gpuAddress] values and limit there to be one concurrent submission. A scatter command as described for CUDA/HIP graphs and Vulkan could then run to update the argument buffers for each dispatch. We could use the features of MTLIndirectCommandBuffer to remove the concurrent execution limitations by having each submission have its own binding table buffer (or subrange of a shared ringbuffer) and then use [MTLIndirectComputeCommand setKernelBuffer] to adjust the offsets per command using it, but this would likely introduce too much overhead to be worth it given that we tend not to issue the same command buffers concurrently.

Copies and fills cannot have their target buffers changed and will need to be emulated with dispatches.

Direct3D12

The table-based nature of D3D12 works well here; upon submission a root signature will be constructed referencing both the static descriptors of the pre-recorded command buffer as well as the dynamic descriptors of the binding table. Shaders will be emitted to use the appropriate subtables.

Copies and fills cannot have their target buffers changed and will need to be emulated with dispatches.

WebGPU

There are no reusable command buffers in WebGPU and emulation (either via iree_hal_deferred_command_buffer_t or a more optimal cached data structure) will be required.

ScottTodd commented 1 month ago

Lots of great stuff here!

I'm still reading through but was curious about the construction and analysis of hal.device.memoize. At the point where command buffers are created, we've already done substantial specialization, so would it be possible for different dispatches to share command buffers? I suspect that reusable indirect command buffers would get stacking benefits with kernel deduplication (https://github.com/iree-org/iree/issues/12230, e.g. converting static shapes to dynamic shapes so dispatches that have the same structure but different shapes can be shared).

benvanik commented 1 month ago

Good q: multiple homogeneous devices can share the code to record command buffers by definition (as all device-dependent queries resolve to the same values) - this would be useful in sharding scenarios where the same commands are being run on different data on different devices. When hal.device.memoize is outlined device analysis (which is in the multi-device branch - need to land that!) will indicate which devices may be used to run that code and ensure both paths go down that. In other cases where command emission is not as device-dependent but was baked out earlier (like if sharding was done early so there was @run_shard_0() and @run_shard_1() by the time we were converting into HAL) we will outline all the regions to functions and then could use a function deduplication pass to share the code. What's fun about that is then multiple immutable globals will be initialized with the result of the same function and we could then fold all those globals (per device) - it should all just work (when we have function folding, natch :).

As for the runtime instances (vs the recording code) that's trickier - most devices at runtime want their own copy of the command buffer even if they are of the same type (you want your CUgraph uploaded and resident on each device instead of it trying to fetch it across the bus each time you launch it, Vulkan doesn't let you share VkCommandBuffer across VkDevice instances, etc). Because of this we'll always create one copy per device that uses it at runtime. Something I'm going to plan for is that hal.device.memoize will also have a queue affinity so that we could create one copy per device per queue used on that device (sharing the same recording code as above) - then when we start modeling NUMA nodes with queues instead of devices we can ensure each node has a copy of the command buffer local (instead of just at the device level). That will probably happen in the runtime driver which takes a potential queue affinity set in iree_hal_command_buffer_create - then drivers can decide if they want to replicate the recorded commands or not.