benvanik commented 1 year ago

NOTE: some of this is still in flux and when completed we can document this in more detail; this serves as a WIP design overview and tracking issue but not a design doc. This works backwards from what we have at runtime through to the frontend so we can see what gaps still exist.

On the path to multiple heterogeneous HAL devices is multiple queues on the same logical HAL device. This writeup presupposes that we start with that as a milestone and then incrementally build to heterogeneous/discrete devices.

HAL devices and queues

A logical device as exposed to IREE via the HAL may be composed of several (likely homogeneous) physical devices. Though not a strict requirement the assumption is that there's a unified address space between the physical devices such that there's only a performance penalty for misplacing memory: this means a buffer allocated via the logical device is accessible by any physical devices (e.g. peered GPUs). By using this approach things get much simpler in the compiler and runtime as compiled/loaded executables, associated data structures like pipeline layouts, and buffers can be handled once and uniformly regardless of the number of physical devices available at runtime. Extending support to heterogeneous devices then becomes focused on that work alone and the additional costs are paid for only where required while any number of homogeneous devices can be scheduled more efficiently. That is to say: supporting multiple hardware queues on a single physical device is functionally equivalent from the compiler's perspective to multiple physical devices exposed as a single logical device.

An example of the various configurations in mspaint:

This works because each device queue operation specifies a queue affinity in addition to fences that order the operations explicitly:

%buffer = hal.device.queue.alloca<%device : !hal.device>
    affinity(%affinity)
    wait(%wait_fence) signal(%signal_fence)
    pool(%c100_i64)
    type(DeviceLocal) usage(Transfer)
    : !hal.buffer{%size}
hal.device.queue.execute<%device : !hal.device>
    affinity(%affinity)
    wait(%wait_fence) signal(%signal_fence)
    commands([%cmd0, %cmd1])
hal.device.queue.dealloca<%device : !hal.device>
    affinity(%affinity)
    wait(%wait_fence) signal(%signal_fence)
    buffer(%buffer : !hal.buffer)
hal.device.queue.flush<%device : !hal.device>
    affinity(%affinity)

At runtime the device implementation gets that information and can use the affinity provided in order to map to whatever internal representation of queues it wants (CUDA streams, Vulkan VkQueues, Metal MTLCommandQueues, distinct thread pools attached to NUMA nodes, etc). The role of the compiler is to take operations in the stream dialect and assign affinities to them such that lowering into the hal dialect is a simple mapping to a !hal.device and queue affinity.

Where possible we intentionally want to keep which physical queue an operation is running on as an implementation detail. In most cases we can express the full asynchronous operation sequence with high enough fidelity that the runtime is able to dynamically partition the work as well as running all operations synchronously (referred to as the "forward progress guarantee"). There are some cases where this may not be the case but it's largely a representational issue we'll eventually be able to work around for programs we generate ourselves vs ones that come from a user with specific requirements.

The largest TBD design item is around expressing device/queue affinities in the stream dialect (discussed below). Once that's done we already have queue affinities in the HAL and can perform the mapping to underlying device queues. The HAL queue affinity is specified as a bitmask indicating which logical queues an operation is allowed to run on (similar to thread affinity masks on the CPU): this provides a hint to the runtime that certain operations may benefit from running in the same location even if they don't require it.

Example of the compiler letting operations run on any available queue (all bits set):

%buffer0 = hal.device.queue.alloca affinity(-1)
hal.device.queue.execute affinity(-1) commands([%cmd0]) // w/ %buffer0
hal.device.queue.dealloca affinity(-1) buffer(%buffer0)

Or on a particular queue:

%buffer0 = hal.device.queue.alloca affinity(0b01)
hal.device.queue.execute affinity(0b01) commands([%cmd0]) // w/ %buffer0
hal.device.queue.dealloca affinity(0b01) buffer(%buffer0)

Or overlap execution by running on two queues (which for most operations are allowed to be the same physical queue):

%buffer0 = hal.device.queue.alloca affinity(0b01)
hal.device.queue.execute affinity(0b01) commands([%cmd0]) // w/ %buffer0
hal.device.queue.dealloca affinity(0b01) buffer(%buffer0)
%buffer1 = hal.device.queue.alloca affinity(0b10)
hal.device.queue.execute affinity(0b10) commands([%cmd1]) // w/ %buffer1
hal.device.queue.dealloca affinity(0b10) buffer(%buffer1)

There's enough information to allow for some dynamic runtime fun: for example fences could track the physical queue they're to be signaled on and if another operation is enqueued waiting on that fence and it's allowed to run on at least the same queue the runtime can pin it there:

// runtime maps -1 affinity ("any") to CUDA device 0 stream 4 and stashes info on %alloca_fence
// that stream was chosen because the runtime knew it was idle (or rng/roundrobin/etc)
%buffer0 = hal.device.queue.alloca affinity(-1) signal(%alloca_fence)
// runtime device sees device 0 stream 4 on %alloca_fence, bitwise ANDs with the specified affinity, and picks that
// execution is able to use the memory local to the device even though the exact device was not specified
hal.device.queue.execute affinity(-1) wait(%alloca_fence) commands([%cmd0])

This allows for sequences of operations to be co-located based on load, quota, etc. Though there's still some mechanics required to improve it this also works with joins:

hal.device.queue.execute affinity(0b01) signal(%exec_fence_0)
hal.device.queue.execute affinity(0b10) signal(%exec_fence_1)
%join_fence = hal.fence.join %exec_fence_0, %exec_fence_1
// runtime can choose any queue, but should prefer to mask with 0b11
hal.device.queue.execute affinity(-1) wait(%join_fence)

This ability to chain is important as the fences may persist across control flow, function boundaries, or invocation boundaries all the way back up to user code and a sequence of invocations can use the masking to statelessly maintain locality while retaining the ability to dynamically load balance and provision execution resources.

There are some open issues around dynamic queue availability and collectives but there are approaches we can explore in the compiler to work around that (structured collectives vs explicit send/recv/etc). For now we assume collective exploration work will happen with a fixed set of queues.

In summary: a HAL device contains one or more queues and neither devices or their queues need to map 1:1 to physical constructs. Device queues are roughly equivalent to CUDA streams in that they are a way to logically sequence work independent of the physical command queues on hardware devices. Though physical queues may execute in FIFO order the way work is scheduled in the HAL does not presume this is the case and as such work is made available for some set of queues to process and it's up to the runtime/driver/device to make that happen. It's always legal to execute in-order and simpler backends may always choose to do that.

Affinities in the `stream` dialect

Almost all operations in the stream dialect have an optional #stream.affinity attribute. When lowering into the stream dialect certain flow operations (dispatches, transfers, etc) which are annotated can have that annotation mapped in to an affinity on the stream ops. The partitioning pass that runs assumes no/partial affinities and uses any provided to explicitly specify affinities while filling in the rest as it runs. In extreme cases a user could specify an affinity for every op but the intent is that we eventually assign them ourselves as part of a SPMD pass.

The affinity attribute contents are currently undefined and it may be worth changing to an interface. There's a few things that we need to be able to get from such an interface in the stream dialect transformation pipeline:

query the resource constraints (#stream.resource_config)? (currently specified via --iree-stream-resource-* flags)
get information for partitioning (when we do it):
- capabilities (collectives/etc)?
- performance estimates/cost model info to rank dispatches?
- etc...
compare two affinities to see whether they are compatible:
- is memory shared (buffer reuse)?
- are explicit queue transfers required (even if memory is shared)? (currently assumed no and they get dropped)
- are explicit cache operations required (flush/invalidate/etc)? (currently assumed no and they get dropped)
map into one or more HAL DeviceTargetAttrs (device/queue/compilation info)? (currently ignored for --iree-hal-target-backends)

Regardless of what implements it we'll want a composite attr that lets us OR affinities together and wildcard - this would let us say things like "run these operations on a GPU, but not specifically vulkan/cuda/etc" or "run these operations on a CPU with multithreading" etc. For any parameters like resource constraints this could also perform the min/max required.

We could have the existing HAL DeviceTargetAttr implement the interface to service all these requests keeping device info centralized in their respective target backends. Somewhat of a weird layering but we already have that in other places (like the ABI emitting HAL import/export ops). It also makes things extensible for out-of-tree backends.

Since affinities have never been set there's likely to be some work to fully propagate them and handle them as well as cleanup to do on places currently working around their omission. Thankfully on our near-term targets most of the answers to the above queries would be the default we have today so we can work through those issues incrementally.

Defining affinities at the flow(ish) level

The critical bit of the process is that during flow->stream lowering we are able to put affinities on ops. How or even if the flow ops are annotated is kind of whatever-goes. We need to be able to capture a with torch.cuda.device(1): around a bunch of code in pytorch, or JAX device_put, or command line flags, or auto-tuning annotations, etc. We'll also support no/partial placement as we do today such that the stream dialect can perform that during partitioning.

Input programs (mhlo/linalg/etc) can put attributes on their high-level ops. This has some reliability issues (most patterns don't tend to preserve attributes) so alternatives may be useful: for example attributes on regions/functions that can be verified to preserve dialect attrs during conversion/transformation. A recursive walk up to ancestors would let us tag entire modules, functions, etc with placement information and write code like if (user says run on cuda at runtime) { call function with cuda placement } else { call function with cpu placement } or whatnot. Dispatch region formation would need to be updated to look for these attributes and not fuse ops with different placement as well as carry across the placement on the flow.dispatch op. We'd want to strip the placement attributes while creating the region such that we could still deduplicate executables: two executables produced from two dispatches for different compatible devices should only be compiled once.

Propagation and specification of affinities on high-level ops are left to the user but there are some representations we'd really like to work towards (structured collectives/etc). In the first iteration focusing on simple rules will let us at least get some programs running: for example, ensuring tensor insertion is done in a way that doesn't add global barriers (what we really need the structured form to solve). One way to do this may be a linalg_ext op that does a concat without a long serialized string of insert_slice ops but instead more like a gather.

What's an "affinity" in the layers above `stream`?

I think this is the biggest design question. We may be able to side-step it with the interface and implementing it with the existing HAL DeviceTargetAttr. A user could use whatever they want in their layers and then before lowering into flow they map them to DeviceTargetAttr or implement the interface themselves when embedding the compiler. There's still value in a more coarsely defined set of things than a DeviceTargetAttr: device categories ("cpu" vs "gpu"), capabilities ("unified host/device memory" vs "discrete host/device memory"), and clustering ("8 gpu devices"), etc.

I think we can unblock work by using the DeviceTargetAttr as most of the initial usage is from frontends that specify exact devices (pytorch/jax). When we start building out SPMD partitioning we can look into refining that more.

Tasks

MVP

The initial version should have compiler support for input-specified affinities (a way to specify them, have them get propagated, etc), correct but non-optimal partitioning and execution ordering, and mapping into the HAL affinities. We can test this with the existing runtime on all backends as queue affinities are currently ignored and the rule is that we should always be able to do that. The local-task runtime backend should be fairly easy to extend into multi-worker pool mode to allow us to address up to 64*64 individual cores (64 queues with 64 workers each) and then add support for NUMA topology specification. CUDA is going to be trickier due to some tech debt but should be possible to bring online even if running serially.

Compiler

Stream dialect (@benvanik):

[x] Make AffinityAttr an interface with minimal set of methods to start (assume defaults for others)
[x] Implement the AffinityAttrInterface on DeviceTargetAttr
[ ] Fix RedundantTransferElision to check the interface for transfer compatibility

Flow dialect (@benvanik):

[x] Ensure DeduplicateExecutables deduplicates executables and preserves targets
[x] Preserve affinities in flow->stream lowering

HAL dialect (@benvanik):

[x] Derive queue affinities in stream->hal lowering (not just -1)
[ ] ~~Make AssignTargetDevicesPass use existing affinities (may just work)~~
[ ] ~~Ensure MaterializeInterfaces produces the correct set of variants based on usage (may need a new pass)~~
[ ] ~~Ensure LinkExecutables handles variants correctly~~

Runtime

CPU (@benvanik):

[x] Make the local-task device be able to support multiple task system executors
[x] Route queue affinities to executors

CPU NUMA support (?):

[ ] NUMA node pinning APIs and topology queries (per-process is easy, per-thread is tricky, needs testing)
[ ] Parse device params to set up topology (--device=local-task://?nodes=0,1,2 or something)

CUDA (?):

[ ] Make cuda driver async (this will be tricky, rest of the work can proceed)
[ ] Parse device params to set up topology (--device=cuda://GPU-A/?peers=GPU-B,GPU-C or something, --device=cuda://all, etc)
[ ] Make the cuda device be able to create multiple CUdevices with >=1 stream per device
[ ] Route queue affinities to devices
[ ] Ensure API calls use the proper device

Out of scope

Multiple heterogeneous devices
Automatic SPMD partitioning
Collectives support
Efficient transformation of already partitioned input programs
Dynamic queue availability with support for collectives
Advanced target specifications on command lines (iree-compile --run-matmuls-on-device-2 etc)

/cc @powderluv @harsh-nod I don't estimate the compiler tasks will take very long as I've got a good idea of where the skeletons are hiding. At least getting the CPU side ready for NUMA is also going to be pretty quick but I may need help from someone with a NUMA system and linux API experience to make that actually work. CUDA is 🤷 but could probably be hacked to make something work (removing some synchronizes and such).

aviator19941 commented 1 year ago

@benvanik Thanks for the overview, it's very helpful. I will start with NUMA node pinning APIs and topology queries task.

mattwalsh commented 1 year ago

So if I'm reading this right...

for now, we rely on the higher-level ops to have affinity settings that will propagate down to stream via some of the new work described above
that way, enterprising folks can marionette concurrency explicitly (but target-independently) while also creating a foundation for doing more sophisticated things later
one thing I'm not at all clear on is how the wait / signal controls manifest in the HAL. As in, if I develop a strategy that says "define 2 affinity values(presumably with DeviceTargetAttrs to indicate GPU), and dispatch 2 units of work via queues with affinity = 'don't care / -1'" would one simply need to assign the affinities to the ops and the rest would work? Or would one have to at the high level express barrier-like things to induce the 'fence' and 'signal' elements to show up in the HAL?

harsh-nod commented 1 year ago

Thanks Ben for the awesome writeup! This really helps clarify things and presents a good plan of execution. I had a few questions and comments.

If I understand this right, the new abstraction we are adding to iree is that of the hal.device.queue which is like a CUDA stream. First, how would decide how many logical queues to instantiate per logical device? Or would we determine that from the parallelism inherent in the graph? In a manner similar to stream.cmd.concurrent? Would each stream.cmd.execute lower to a hal.device.queue?
I do like the idea of operating on logical queues and devices. This gives the compiler more freedom to optimize the graph and allows for more dynamic underlying hardware topologies. So if you request an op to run in parallel on 4 GPU, then it will if you have 4 GPUs available, but it will degrade gracefully to the case where there are fewer than 4 GPUs (so you get mostly what you want with 3, less so with 2 and serial execution with 1 GPU)
In this implementation, are we assuming no data transfers between queues? And is this because of the unified address space assumption? Would this be override-able based on the information provided by the affinity interface?
Seems like an affinity of -1 passes the responsibility of device assignment from the compiler to the runtime. This would mean that the runtime would have to keep track of the status of all the queues to make a decision of what device to dispatch on. Do you think this information would come from the compiler or would the runtime have to monitor the devices or both?

Totally agree on starting with multiple queues on the same HAL device as the first milestone and then moving on to heterogeneous devices and looking forward to working on this!

powderluv commented 1 year ago

Bumping this up. I think this would be the first feature to implement for multi-device IREE. Peeling off items from https://docs.google.com/document/d/11UJmjPFSFWvFITwo-J1Z7OZ7Rqpz7JVk_kBl3JMQWQw/edit?usp=sharing

allieculp commented 1 year ago

@powderluv Is anyone from the Nod team working on this? @benvanik

rednoah91 commented 10 months ago

Hi @benvanik , do you know any follow-ups or plan of the multiple devices support feature?

iree-org / iree

Implement initial affinity support for multiple queues. #10765

HAL devices and queues

Affinities in the `stream` dialect

Defining affinities at the flow(ish) level

What's an "affinity" in the layers above `stream`?

Tasks

MVP

Compiler

Runtime

Out of scope

iree-org / iree

Implement initial affinity support for multiple queues. #10765

HAL devices and queues

Affinities in the stream dialect

Defining affinities at the flow(ish) level

What's an "affinity" in the layers above stream?

Tasks

MVP

Compiler

Runtime

Out of scope

Affinities in the `stream` dialect

What's an "affinity" in the layers above `stream`?