Open benvanik opened 1 year ago
@benvanik Thanks for the overview, it's very helpful. I will start with NUMA node pinning APIs and topology queries task.
So if I'm reading this right...
Thanks Ben for the awesome writeup! This really helps clarify things and presents a good plan of execution. I had a few questions and comments.
If I understand this right, the new abstraction we are adding to iree is that of the hal.device.queue which is like a CUDA stream. First, how would decide how many logical queues to instantiate per logical device? Or would we determine that from the parallelism inherent in the graph? In a manner similar to stream.cmd.concurrent? Would each stream.cmd.execute lower to a hal.device.queue?
I do like the idea of operating on logical queues and devices. This gives the compiler more freedom to optimize the graph and allows for more dynamic underlying hardware topologies. So if you request an op to run in parallel on 4 GPU, then it will if you have 4 GPUs available, but it will degrade gracefully to the case where there are fewer than 4 GPUs (so you get mostly what you want with 3, less so with 2 and serial execution with 1 GPU)
In this implementation, are we assuming no data transfers between queues? And is this because of the unified address space assumption? Would this be override-able based on the information provided by the affinity interface?
Seems like an affinity of -1 passes the responsibility of device assignment from the compiler to the runtime. This would mean that the runtime would have to keep track of the status of all the queues to make a decision of what device to dispatch on. Do you think this information would come from the compiler or would the runtime have to monitor the devices or both?
Totally agree on starting with multiple queues on the same HAL device as the first milestone and then moving on to heterogeneous devices and looking forward to working on this!
Bumping this up. I think this would be the first feature to implement for multi-device IREE. Peeling off items from https://docs.google.com/document/d/11UJmjPFSFWvFITwo-J1Z7OZ7Rqpz7JVk_kBl3JMQWQw/edit?usp=sharing
@powderluv Is anyone from the Nod team working on this? @benvanik
Hi @benvanik , do you know any follow-ups or plan of the multiple devices support feature?
NOTE: some of this is still in flux and when completed we can document this in more detail; this serves as a WIP design overview and tracking issue but not a design doc. This works backwards from what we have at runtime through to the frontend so we can see what gaps still exist.
On the path to multiple heterogeneous HAL devices is multiple queues on the same logical HAL device. This writeup presupposes that we start with that as a milestone and then incrementally build to heterogeneous/discrete devices.
HAL devices and queues
A logical device as exposed to IREE via the HAL may be composed of several (likely homogeneous) physical devices. Though not a strict requirement the assumption is that there's a unified address space between the physical devices such that there's only a performance penalty for misplacing memory: this means a buffer allocated via the logical device is accessible by any physical devices (e.g. peered GPUs). By using this approach things get much simpler in the compiler and runtime as compiled/loaded executables, associated data structures like pipeline layouts, and buffers can be handled once and uniformly regardless of the number of physical devices available at runtime. Extending support to heterogeneous devices then becomes focused on that work alone and the additional costs are paid for only where required while any number of homogeneous devices can be scheduled more efficiently. That is to say: supporting multiple hardware queues on a single physical device is functionally equivalent from the compiler's perspective to multiple physical devices exposed as a single logical device.
An example of the various configurations in mspaint:
This works because each device queue operation specifies a queue affinity in addition to fences that order the operations explicitly:
At runtime the device implementation gets that information and can use the affinity provided in order to map to whatever internal representation of queues it wants (CUDA streams, Vulkan VkQueues, Metal MTLCommandQueues, distinct thread pools attached to NUMA nodes, etc). The role of the compiler is to take operations in the
stream
dialect and assign affinities to them such that lowering into thehal
dialect is a simple mapping to a!hal.device
and queue affinity.Where possible we intentionally want to keep which physical queue an operation is running on as an implementation detail. In most cases we can express the full asynchronous operation sequence with high enough fidelity that the runtime is able to dynamically partition the work as well as running all operations synchronously (referred to as the "forward progress guarantee"). There are some cases where this may not be the case but it's largely a representational issue we'll eventually be able to work around for programs we generate ourselves vs ones that come from a user with specific requirements.
The largest TBD design item is around expressing device/queue affinities in the
stream
dialect (discussed below). Once that's done we already have queue affinities in the HAL and can perform the mapping to underlying device queues. The HAL queue affinity is specified as a bitmask indicating which logical queues an operation is allowed to run on (similar to thread affinity masks on the CPU): this provides a hint to the runtime that certain operations may benefit from running in the same location even if they don't require it.Example of the compiler letting operations run on any available queue (all bits set):
Or on a particular queue:
Or overlap execution by running on two queues (which for most operations are allowed to be the same physical queue):
There's enough information to allow for some dynamic runtime fun: for example fences could track the physical queue they're to be signaled on and if another operation is enqueued waiting on that fence and it's allowed to run on at least the same queue the runtime can pin it there:
This allows for sequences of operations to be co-located based on load, quota, etc. Though there's still some mechanics required to improve it this also works with joins:
This ability to chain is important as the fences may persist across control flow, function boundaries, or invocation boundaries all the way back up to user code and a sequence of invocations can use the masking to statelessly maintain locality while retaining the ability to dynamically load balance and provision execution resources.
There are some open issues around dynamic queue availability and collectives but there are approaches we can explore in the compiler to work around that (structured collectives vs explicit send/recv/etc). For now we assume collective exploration work will happen with a fixed set of queues.
In summary: a HAL device contains one or more queues and neither devices or their queues need to map 1:1 to physical constructs. Device queues are roughly equivalent to CUDA streams in that they are a way to logically sequence work independent of the physical command queues on hardware devices. Though physical queues may execute in FIFO order the way work is scheduled in the HAL does not presume this is the case and as such work is made available for some set of queues to process and it's up to the runtime/driver/device to make that happen. It's always legal to execute in-order and simpler backends may always choose to do that.
Affinities in the
stream
dialectAlmost all operations in the stream dialect have an optional
#stream.affinity
attribute. When lowering into the stream dialect certainflow
operations (dispatches, transfers, etc) which are annotated can have that annotation mapped in to an affinity on the stream ops. The partitioning pass that runs assumes no/partial affinities and uses any provided to explicitly specify affinities while filling in the rest as it runs. In extreme cases a user could specify an affinity for every op but the intent is that we eventually assign them ourselves as part of a SPMD pass.The affinity attribute contents are currently undefined and it may be worth changing to an interface. There's a few things that we need to be able to get from such an interface in the stream dialect transformation pipeline:
#stream.resource_config
)? (currently specified via--iree-stream-resource-*
flags)--iree-hal-target-backends
)Regardless of what implements it we'll want a composite attr that lets us OR affinities together and wildcard - this would let us say things like "run these operations on a GPU, but not specifically vulkan/cuda/etc" or "run these operations on a CPU with multithreading" etc. For any parameters like resource constraints this could also perform the min/max required.
We could have the existing HAL DeviceTargetAttr implement the interface to service all these requests keeping device info centralized in their respective target backends. Somewhat of a weird layering but we already have that in other places (like the ABI emitting HAL import/export ops). It also makes things extensible for out-of-tree backends.
Since affinities have never been set there's likely to be some work to fully propagate them and handle them as well as cleanup to do on places currently working around their omission. Thankfully on our near-term targets most of the answers to the above queries would be the default we have today so we can work through those issues incrementally.
Defining affinities at the flow(ish) level
The critical bit of the process is that during flow->stream lowering we are able to put affinities on ops. How or even if the flow ops are annotated is kind of whatever-goes. We need to be able to capture a
with torch.cuda.device(1):
around a bunch of code in pytorch, or JAXdevice_put
, or command line flags, or auto-tuning annotations, etc. We'll also support no/partial placement as we do today such that the stream dialect can perform that during partitioning.Input programs (mhlo/linalg/etc) can put attributes on their high-level ops. This has some reliability issues (most patterns don't tend to preserve attributes) so alternatives may be useful: for example attributes on regions/functions that can be verified to preserve dialect attrs during conversion/transformation. A recursive walk up to ancestors would let us tag entire modules, functions, etc with placement information and write code like
if (user says run on cuda at runtime) { call function with cuda placement } else { call function with cpu placement }
or whatnot. Dispatch region formation would need to be updated to look for these attributes and not fuse ops with different placement as well as carry across the placement on theflow.dispatch
op. We'd want to strip the placement attributes while creating the region such that we could still deduplicate executables: two executables produced from two dispatches for different compatible devices should only be compiled once.Propagation and specification of affinities on high-level ops are left to the user but there are some representations we'd really like to work towards (structured collectives/etc). In the first iteration focusing on simple rules will let us at least get some programs running: for example, ensuring tensor insertion is done in a way that doesn't add global barriers (what we really need the structured form to solve). One way to do this may be a linalg_ext op that does a concat without a long serialized string of insert_slice ops but instead more like a gather.
What's an "affinity" in the layers above
stream
?I think this is the biggest design question. We may be able to side-step it with the interface and implementing it with the existing HAL DeviceTargetAttr. A user could use whatever they want in their layers and then before lowering into flow they map them to DeviceTargetAttr or implement the interface themselves when embedding the compiler. There's still value in a more coarsely defined set of things than a DeviceTargetAttr: device categories ("cpu" vs "gpu"), capabilities ("unified host/device memory" vs "discrete host/device memory"), and clustering ("8 gpu devices"), etc.
I think we can unblock work by using the DeviceTargetAttr as most of the initial usage is from frontends that specify exact devices (pytorch/jax). When we start building out SPMD partitioning we can look into refining that more.
Tasks
MVP
The initial version should have compiler support for input-specified affinities (a way to specify them, have them get propagated, etc), correct but non-optimal partitioning and execution ordering, and mapping into the HAL affinities. We can test this with the existing runtime on all backends as queue affinities are currently ignored and the rule is that we should always be able to do that. The
local-task
runtime backend should be fairly easy to extend into multi-worker pool mode to allow us to address up to 64*64 individual cores (64 queues with 64 workers each) and then add support for NUMA topology specification. CUDA is going to be trickier due to some tech debt but should be possible to bring online even if running serially.Compiler
Stream dialect (@benvanik):
Flow dialect (@benvanik):
HAL dialect (@benvanik):
Make AssignTargetDevicesPass use existing affinities (may just work)Ensure MaterializeInterfaces produces the correct set of variants based on usage (may need a new pass)Ensure LinkExecutables handles variants correctlyRuntime
CPU (@benvanik):
CPU NUMA support (?):
--device=local-task://?nodes=0,1,2
or something)CUDA (?):
--device=cuda://GPU-A/?peers=GPU-B,GPU-C
or something,--device=cuda://all
, etc)Out of scope
iree-compile --run-matmuls-on-device-2
etc)/cc @powderluv @harsh-nod I don't estimate the compiler tasks will take very long as I've got a good idea of where the skeletons are hiding. At least getting the CPU side ready for NUMA is also going to be pretty quick but I may need help from someone with a NUMA system and linux API experience to make that actually work. CUDA is 🤷 but could probably be hacked to make something work (removing some synchronizes and such).