EwanC commented 1 year ago

There is a feature request for SYCL_EXT_ONEAPI_GRAPH to support creating an executable graph which contains commands targetting different devices. The complete graph targetting different devices can then be submitted for execution at once.

Creating/submitting such a graph is not possible in the current API, and instead the user must compose executable graph submissions from graphs targetting the individual devices. We should discuss whether multi-device support is a requirement, and if so, iterate on the spec to make it possible.

bjoernknafla commented 1 year ago

We are seeing highly heterogeneous customer chips that contain multiple, different programmable accelerator units - which can be considered independent devices. To target such next-gen chips we need support for multi-device graphs.

Spawned from PR https://github.com/intel/llvm/pull/5626#discussion_r1027117452

EwanC commented 1 year ago

Multi-device Proposal

This proposal changes a modifiable command_graph from being a device-agnostic graph which is targetted to a single device at finalize time, to each node in the modifiable graph targetting a specific device.

As well as allowing a modifiable graph to contain nodes targetting different devices, this change also allows the implementation to process nodes which require device information when the CFG is evaluated. For example, SYCL reductions require a work-group/sub-group size to be known, which is normally gathered by the runtime from the device associated with the queue.

This has design drawbacks for the use-cases where the user would like to run the same command_graph on N devices, as rather than finalizing an executable graph N times for N devices, the user now has to record N modifiable graphs as well as finalize each of them. If this use case does become a usability issue, we could provide a specific API to support it. Something like: a) provide an overloaded finalize() method which takes a single device and overwrites the devices used in the modifiable graph with the passed device, so allowing a single modifiable graph to still be reused for N devices. Or b) updates the devices in a modifiable command_graph individually, similar to parameters, so that the same modifiable graph can be updated to new devices then refinalized.

CUDA Graphs Analysis

cudaSetDevice used to set global device future commands will execute on.
A CUDA stream will target the current global device, and a graph can capture multiple steams during recording to create multi-device graph. Or capture same stream after global device changed.
When manually constructing the graph, nodes will be assigned to the currently active device from cudaSetDevice.
cudaGraphLaunch(execGraph, mainstream) can launch a multi-device graph on a stream associated with a single GPU.

Record & Replay API modifications

Queue recording can capture multiple queues, with each queue targetting a different device. A captured node stores the device tied to the queue rather than the specific queue.

Explicit API modifications

Add a device parameter to graph::add() to specify the device tied to the node.

Finalize modification

The SYCL context passed to command_graph::finalize() must contain all the devices used in graph nodes.

Modifications to executable command_graph submission

The queue an executable graph is submitted to needs to have the same context as the executable graph, however execution of the multi-device graph is not limited to the device associated to the queue. Instead, the queue is used as the mechanism to submit the graph and express execution dependencies.

PI/UR Implementation

We anticipate that PI/UR command-buffers will be single device, but a fine-grained sync primitive will be designed to trigger commands in the other command-buffers for different devices in the same graph. This is something flagged to the UR team in our draft PI command-buffer design, and will be tackled once the basic command-buffer interface is working.

In the interim, we could disable the multi-device feature in the implementation as a limitation of the poc.

EwanC commented 1 year ago

We've got feedback that the use case of finalizing the same graph for N devices should have first-class support.

There are two ways I can see to enable this while still supporting the multi-device concept:

1) Keep the existing device-agnostic modifiable graph design, but enable use of placeholders devices for a node during graph construction that will be populated on finalization.

2) Make a modifiable graph device-specific as proposed above, but provide an update mechanism.

I have a preference for 2) because I think being device-specific allows implementations the flexibility to use device information when processing nodes as they are added to the graphs. This is the current situation for implementations, as the device associated with a queue can be used, and this extension should be exposing the existing graph representation of SYCL, rather than trying to rearrange the runtime implementation to provide something more powerful. Language features like device USM pointers or builtin kernels used by a user also effectively tie a node to a device.

reble / llvm

Consider multi-device support in graphs API #7

Multi-device Proposal

CUDA Graphs Analysis

Record & Replay API modifications

Explicit API modifications

Finalize modification

Modifications to executable command_graph submission

PI/UR Implementation