RFC: Multi-device Resources

Summary

This proposal outlines how resources (buffers, images, etc.) and related classes and structs are prepared for multi-device support in Atom. This is the first necessary step towards multi-device support and already comprises of a significant amount of changes. The main idea here is to introduce a new layer of data structures in the RHI, which support multiple devices and then forward calls to the RHI backends for each device. A non-goal of this proposal is to already support multi-device execution, although changes to the frame graph already have to be made for this to work as well as the transition from multi-device objects to single-device execution, which currently happens when advancing from frame graph compilation to execution. A side goal is to keep the backend and RPI code as unaffected as possible to (1) avoid having to implement multi-device features multiple times within every backend and (2) avoid all higher-level code (RPI and above, as well as any third-party Gems) having to change much. However, this is not completely possible to avoid.

What is the relevance of this feature?

The ultimate goal is to support multi-device execution within Atom. This could for example be used to render stereoscopic images for virtualy reality headsets on two devices simultaneously, i.e., one device per eye. The first step towards this goal is removing the globally accessible device. When looking into this, one quickly finds that in most places, one does not know on which device a resource should then be allocated on, since there is not enough context for the execution. For example, when mesh data is loaded and uploaded to the device, one does not know which device to use, since one doesn't know which render passes will use that mesh and where they will be executed. It is actually likely, that the data is required on multiple devices. Therefore, support for having resources on multiple devices is necessary and this needs to be established before any execution on multiple devices can be accomplished.

Feature/Technical Design Description

The basic idea of the technical solution is to introduce new classes and structs in the RHI which support multi-device resources. This includes most of the classes derived from DeviceObject with few exceptions. The base class for these new classes will be a MultiDeviceObject, which stores a device mask, denoting on which devices this resource will be present. This device mask is a 32-bit field, where each bit represents one device. The RHISystem will support an array of multiple devices where each device can be accessed through an index (corresponding to the bit position in the device mask, e.g., the least significant bit corresponds to the device with index 0). Internally, each multi-device object will then contain a map, which maps from the device index to the corresponding DeviceObject-derived object. An example is given in the following UML diagram, where we show this for the buffer resources.

UML Diagram showing DeviceObject <|- DeviceResource <|- DeviceBuffer ----<> MultiDeviceBuffer -|> MultiDeviceResource -|> DeviceObject

Any API calls on a multi-device object will be forwarded to all managed device objects, unless otherwise specified, e.g., through providing a mask, selecting which devices should apply the call. The masks may also contain bits for devices that are not actually available, which will be ignored. This allows us to have an AllDevicesMask mask, where all bits are set independently of how many are actually available. The default device has device index 0 and will be also available through a DefaultDeviceMask, where only the least significant bit is set. The following is a list of all the classes, where both a single-device and a multi-device version will be available. All "final" classes, i.e., those that are not further derived from, contain a map in the respective multi-device class, which maps to the corresponding single-device class objects.

Object
    Fence
    IndirectBufferSignature
    PipelineLibrary
    PipelineState
    RayTracingBlas
    RayTracingBufferPools
    RayTracingPipelineState
    RayTracingShaderTable
    RayTracingTlas
    Resource
        Buffer
        Image
        Query
        ShaderResourceGroup
    ResourcePool
        BufferPoolBase
            BufferPool
        ImagePoolBase
            ImagePool
            StreamingImagePool
            SwapChain
        QueryPool
        ShaderResourceGroupPool
    TransientAttachmentPool

DeviceObject classes without multi-device version

There are a few classes in the DeviceObject hierarchy, which do not get a multi-device class version. The first few classes of these are ResourceView-derived classes, i.e., BufferView and ImageView. Outside their RHI backend derivatives, these two only have two members each, the corresponding resource (buffer or image) and a descriptor, which is device independent. Since the memory management regarding views is quite complicated as is, as the resources store references to their views and a layer of multi-device objects would complicate this even more, we opt to change the higher-level interface at this point. Instead of a multi-device view, we use the combination of the resource and descriptor instead of binding them together.

The classes AliasedHeap, AliasedAttachmentAllocator and CommandQueue seem to not need a multi-device version, since they are never used outside of a single-device context. However, the class CommandQueue will actually handle multi-device classes and translate them to single-device objects, without needing a separate multi-device version, as we do not currently plan to be able to submit work to multiple devices at the same time. More details will be discussed further below.

Other classes and structs with both a single-device and multi-device version

Other classes and structs, which do not themselves derive from DeviceObject, need multi-device support as well. Most of them are descriptors, requests and items (for copy, dispatch and draw). They need a multi-device version, because one or more of their members reference some DeviceObject derivative and are used both in the higher-level API, where we now want to use multi-device objects, as well as in the RHI backends, where we need their single-device version. In most cases, those classes and structs are passed as parameters to some methods of the resource classes. Therefore, the conversion happens mostly inside the multi-device classes, when they loop over their single-device object map and forward the call.

Some of the multi-device classes need a map as well in order to be able to be used properly or more efficiently. These classes are ImageSubresourceLayoutPlaced, IndirectBufferWriter, PipelineLibraryDescriptor, ShaderResourceGroupData and the *Item classes. The derivation of ImageSubresourceLayoutPlaced from ImageSubresourceLayout and their polymorphic usage actually makes it difficult to have a multi-device version of them without using unnecessarily performance-impacting dynamic casts to figure out which one is actually behind a pointer. Therefore, we propose to merge the two and, if necessary at all, add a boolean, whether the offset is used or not.

Here is a list of the affected data structures:

Classes:

IndexBufferView
IndirectBufferView
IndirectBufferWriter
RayTracingBlasDescriptor
RayTracingTlasDescriptor
RayTracingPipelineStateDescriptor
RayTracingShaderTableDescriptor
ResourceEventInterface
ResourceInvalidateBus
ShaderResourceGroupData
StreamBufferView

Structs:

BufferInitRequest
BufferMapRequest
BufferMapResponse
BufferStreamRequest
CopyBufferDescriptor
CopyImageDescriptor
CopyBufferToImageDescriptor
CopyImageToBufferDescriptor
CopyQueryToBufferDescriptor
CopyItem
DispatchArguments
DispatchItem
DispatchRaysArguments
DispatchRaysItem
DrawArguments
DrawItem
DrawItemProperties
ImageInitRequest
ImageSubresourceLayoutPlaced
ImageUpdateRequest
IndirectArguments
IndirectBufferSignatureDescriptor
PipelineLibraryDescriptor
RayTracingGeometry
RayTracingTlasInstance
RayTracingShaderTableRecord
StreamingImageInitRequest
StreamingImageExpandRequest
TransientAttachmentPoolDescriptor

Switching from multi-device to single-device for execution

At some point we have to switch from multi-device resources to single-device resources. The last chance for that is when submitting work to a specific device. Therefore, we propose (at least for now) to use a single frame graph for multiple devices. Doing this should help to synchronize the work between multiple devices, by being able to use already present synchronization mechanisms in the frame graph, just extended for multiple devices. This is not fleshed out in detail yet and not the goal of this proposal, but in order to get this code to work at all, we need to have some point of transition. For now, we basically transition between frame graph compilation and execution. Since we are not actually submitting work to multiple devices yet, our plan is to always transition to the default device at the moment.

As mentioned before, the multi-device versions of *Item classes contain a map to their single device version. This is not strictly necessary, since a conversion could simply be done when necessary, but since they are potentially kept for multiple submissions and even multiple frames, it is more efficient to cache the single-device structs as well. The conversion from the multi-device to single-device version happens in the RHI::CommandList::Submit methods, which forward the calls to the corresponding backend.

Planned git history

We intend to divide the pull request for this proposal into four commits, which should simplify the review. Also, looking at the changes as a whole, git will not detect renaming of files properly, which it still does, when looking at the first commit only for example.

The first commit is "simply" a rename of all relevant structs and classes to Device*. Since this can mostly be done by search and replace with little manual intervention, there should not be much that could fail here.
In a smaller, second commit we actually initialize multiple devices in the RHISystem. We introduce a new command line switch (--rhi-multiple-devices) to initialize all available devices, otherwise only one device is initialized as before. Note again that we will allocate resources on these devices, but only submit commands to the default device.
The next commit introduces the new multi-device classes with the original naming of the classes that were renamed in the first commit.
Finally, in the last commit, we will rewrite everything to use the new multi-device RHI classes rather than the Device* classes. This concerns mainly the RPI classes of those resources, but also direct uses of the RHI classes, which constitute a considerable amount of the changeset. This will be the critical commit to check, if everything is working.

What are the advantages of the feature?

The RHI::Device is not a global anymore and it is possible to use multiple devices.
It is not necessary to know in advance, which devices a resource will be used on, since it can be easily allocated on multiple devices.
The API of how to use the RHI resources hardly changes and only few changes in the RHI backends are necessary.

What are the disadvantages of the feature?

Adding a layer in the RHI will have some minimal performance impact.
Some complexity is added to the RHI layer that future developments have to consider.
Some changes at higher levels (RPI, FeatureProcessors, Gems) are necessary.

How will this be implemented or integrated into the O3DE environment?

The feature will be developed in a branch parallel to the development branch, until properly tested and ready for merging. When merged, third-party projects will have to make some smaller adjustments to the new API changes. During the testing and development phase, we will need help for MacOS/Metal and mobile plattforms, since we do not have the necessary hardware and software available to test this.

Are there any alternatives to this feature?

We tried an approach to introduce multi-device support in the RPI resources, which turned out not to be possible due to some objects needing multi-device support in the RHI. Another approach would be to support multi-device resources directly within the RHI resources without an additional layer at the cost of having to implement everything for each backend which, would probably add unnecessary code duplication at hardly any performance benefit. It would also be possible not to have multi-device resources at all, but that would complicate any further multi-device developments, since resources then have to be held on multiple devices at a much higher level in the code base, leading to the inconvenient case that every feature that should support multiple devices has to keep track of resources on each device manually. This would likely also lead to a lot of code duplication.

How will users learn this feature?

Most of these changes should not affect users at all. Any changes necessary to higher level code should be straightforward and easy to implement. The required changes will be documented properly and changes in o3de itself and the atom-sampleviewer will provide plenty examples.

Are there any open questions?

The naming of the classes and structs is still something that is open for debate. The least amount of changes in the higher level code base would happen if the multi-device versions would take over the names of the single-device versions and single-device versions get a new name. I.e., the current Buffer class will be renamed to DeviceBuffer and the multi-device version of it will simply be named Buffer. An alternative would be to call the classes DeviceBuffer and MultiDeviceBuffer respectively.

Based on the discussions here https://github.com/o3de/sig-graphics-audio/discussions/32 I think the RFC looks good for the most part. The design around DeviceBuffer and MultiDeviceBuffer sounds good. A few points below ->

AliasedHeap, AliasedAttachmentAllocator - They both can be used within the context of one device but they do cache transient resources so we will need to modify this to eventually store the transient resource associated with the device the existing heap is attached to. Hopefully extending TransientAttachmentPool to device specific pools will help resolve this issue.
CommandList/CommandQueue device specific management is missing at the moment.
When converting a struct/class that does not inherit from a DeviceObject we need to ask ourselves if the object contains properties that will differ across multiple devices. As an example lets look at IndexBufferView. It contains RHI::Buffer + some other properties related to the buffer. RHI::Buffer which will already incapsulate RHI::DeviceBuffer underneath it. Now do we think that m_byteOffset/m_byteCount/m_format (other properties) will change for different devices? If they don't then do we need to modify RHI::IndexBufferView at all? Same logic can be applied to other structs/objects.
Basically for all the affected Descriptor/Request type structs we should try to follow the same pattern where RPI will populate the data for all the devices and RHI will hopefully deal with device specific data in order to keep RPI api clean. As an example lets visit a struct like RHI::BufferInitRequest vs RHI::DeviceBufferInitRequest. It would be nice to have it setup such that RHI::BufferInitRequest is probably only going to be used by RPI to create an RHI::Buffer where as RHI::DeviceBufferInitRequest is setup to only build out one RHI::DeviceBuffer and should only be used by RHI. It would make sense to add and remove variables within RHI::BufferInitRequest/RHI::DeviceBufferInitRequest to that effect. For example BufferInitRequest could contain a device mask telling RHI how many DeviceBuffers to create. BufferDescriptor probably only makes sense to live within BufferInitRequest and not within DeviceBufferInitRequest. At Buffer::Init RHI will then use BufferInitRequest to build out multiple DeviceBufferInitRequest and populate RHI::Buffer as needed.
Other than adding samples to ASV for multi gpu support also consider extending all our in-game tooling around gpu for thie work. So gpu memory profiler, gpu pass tree visualizer, gpu profiler, etc.

Thank you for the response! Here are our thoughts to the points you raised:

From what we tried so far, it seemed enough to have a single-device and a multi-device version of TransientAttachmentPool as you wrote and no multi-device versions of AliasedHeap and AliasedAttachmentAllocator should be necessary. We could be wrong of course, but that should be easily resolvable if it becomes an issue after all.
We are not sure what exactly you miss here?

However, the class CommandQueue will actually handle multi-device classes and translate them to single-device objects, without needing a separate multi-device version, as we do not currently plan to be able to submit work to multiple devices at the same time. More details will be discussed further below.

Is that the part you were looking for, or are we misunderstanding something here?

The issue here is one of dependencies. Given that we established to have multi-device and single-device versions of DrawItem, which references an IndexBufferView, it is necessary to have both versions as well for IndexBufferView, otherwise the backends would again need a device index to get the proper buffer out of the device-specific DrawItem.
Sounds good. There are certainly some optimizations that can be done with the structs. If we understand it correctly in your example here, you would basically get rid of the DeviceBufferDescriptor and put all necessary information into the DeviceInitRequest directly?
In our sample pull request, we partially changed parts there already, but we didn't want to change too much. Our strategy was not to change anything in the UI, so we used total statistics over all device where applicable. But nicer tools here that allow you to investigate multiple devices individually will certainly be required.

We think our main goal here should be to get these changes in as quickly as possible so that any optimizations and additional features/improvements can be built on later. This concerns points 4 and 5 mainly. The reason behind this is again, that this whole thing is a considerable amount of changes and the development costs compound the longer it takes, since every rebase takes additional time and more changes/fixes.

Ok sure. That is fine then. So DeviceIndexBuffer will essentially be a cached version of device specific data residing within Device*Items.
Yes. Basically avoid two versions of descriptors/requests structs if possible. Of course if there is no way around it we can have a device version but we want to make sure that the device version is for RHI use only and is hopefully not exposed to RPI.

Regarding ResourceView-derived classes (BufferView and ImageView), the current design intends not to provide "multi-device" variants of these classes, which necessitates changes to the API. In particular, ShaderResourceGroupData currently holds arrays of ImageViews and BufferViews and has functionality to get/set individual and also many views at the same time.

In particular, the following methods and members would need to be updated:

Methods
- Get/Set*
- *ImageView
- *ImageViewArray
- *ImageViewUnboundedArray
- *BufferView
- *BufferViewArray
- *BufferViewUnboundedArray
- *BindlessViews
- *ImageGroup
- *BufferGroup
- BindlessResourceViews
- Validate*View
- Validate*ViewAccess
Members
- m_imageViews
- m_bufferViews
- m_imageViewsUnboundedArray
- m_bufferViewsUnboundedArray
- m_bindlessResourceViews

The question is, how should this be realized, since these methods now need to be able to, instead of managing a *View, receive/supply a *ViewDescriptor and a MultiDevice* resource.

Proposed solutions

`AZStd::pair<MultiDeviceResource, ResourceViewDescriptor>`

Potentially the most effective and simplest way would be to use a pair, consisting of the corresponding multi-device resource and its view descriptor. That way, the API would have minimal changes and would remain identical semantically, except at the caller site, where one would have to construct/deconstruct the pairs for further use.

Pass in resources separately

It would also be possible to pass in the resources as separate parameters, also storing them in separate arrays internally, but that would bring up the question of how to return said resource + descriptor. By returning pointers to separate parameters, the API would need to change the most (i.e. changing getters to void and returning both resource + descriptor as reference). A middle way would also be possible (passing in separate, store as pair and return as pair), but that solution seems most inconsistent.

We would prefer the option using pairs, but feedback would be appreciated :blush:

Just so I understand correctly let me go over how I am imagining the initial setup of resource binding will work for multi-gpu support at RPI level. Currently RPI has RPI::ShaderResourceGroup (RPI::Srg for short) which has a bunch of functionality but we only need to focus on two variables here -> RHI::ShaderResourceGroup (RHI::Srg for short) and RHI::ShaderResourceGroupData (RHI::SRGdata for short). RPI currently manages RHI::SRGdata and allows passes to populate it with resource views. This object is then fed into RHI::Srg which uses it to compile and bind all the resource to the native RHI backend object (i.e Descriptor set/descriptor table/argument table).

With multi gpu support I am assuming that each pass at the RPI level still is only meant to be run on one device. Since one pass can only run on one device then a pass can contain RPI::SRG that is only meant to hold views for one device only. This actually makes code at the RPI level simpler. RPI::SRG can behave the same way where one RPI::SRG can have still only have one RHI::SRGdata and one RHI::Srg. RHI::SRG does not need to be split to RHI::DeviceSRG as it will only ever be used on one device. RPI::SRG api related to Get/Set views can stay the same and it's contents can stay the same. The only thing that will change is the higher level code that is invoking this api. The higher level pass code knows which device it will need to run on (from the render pipeline). Lets say this device index is called m_deviceIndex and lives within Pass class. Now any code that is invoking the api of RPI::SRG it will need to get the resource view via the device index. Based on our last conversation since the view hierarchy did not go through Device split I am assuming it is still DX12::BufferView->RHI::BufferView->RHI::ResourceView->RHI::DeviceObject->RHI::Object and RHI::DeviceResource has cache of all the created RH::ResourceViews. So GetResourceView will need to be expanded to take a device index. It will use the device index to cache the RHI::ResourceView within the correct RHI::DeviceResource. So RPI (i.e the higher level code) will provide the device index when creating/getting resource view. This RHI::ResourceView can then be used by the RPI::SRG and no api changes will be needed.

Let me know if this is confusing and if anything needs expanding or if I missed something myself and am completely on the wrong track.

It's not confusing, but unfortunately, you are on the wrong track. It's true that if passes only run on a single device you would not need multi-device SRGs. But passes are not the only place where SRGs are used. Think of the scene SRG which is created independently from a pass and is very likely needed on multiple devices. Another way you can think of it: if there are no multi-device SRGs, you cannot have multi-device Draw-/DispatchItems and we already had a lengthy discussion about why we need those. So, unfortunately, we need a multi-device SRG. Given this, could you please again consider the API changes proposed (i.e. using a pair or passing in as separate parameters).

Ahh I see your point about multi-device SRG. Ok Let me backtrack a bit. In order to support multi-device RHI::SRGs this is how I would think the new design may look like. Remember that RPI::Srg is the one that populates RHI::SRGdata which is just a container object and passes it to RHI::Srg. Lets say that now RHI::Srg goes through the split to RHI::DeviceSRG and now RHI::SRG can contain multiple DX12::SRGs. Same thing applies to RHI::ShaderResourceGroupPool.

RPI::Srg will now need to manage multiple RHI::SRGdata, one per device. So RHI::SRGdata api will not change but RPI::SRG api will need to change. I think your suggestion can work here. So RPI::SRG's Get/Set methods can be modified to take in azstd::pair<MultiDeviceResource, ResourceViewDescriptor> but it needs to know the device index which will tell it which RHI::SRGdata to update. Maybe it can get the device index from the ResourceViewDescriptor? You would also want to keep the current api where Get/Set methods also take in a RHI::ResourceView as the higher level code may want to keep a cached view around and just want to pass that in instead of a pair.

So now we have RPI::SRG that is now managing multiple RHI::SRGdata and one multidevice RHI::SRG. Next we will need to change RHI::SRG's Compile api. Currently RHI::SRG takes a RHI::SRGdata and essentially updates the native RHI object from that. We will need change the api to now take multiple RHI::SRGdata and internally it will call Compile on DX12::SRG with the correct RHI::SRGdata. So just to conclude we will need to modify RPI:SRG api and RHI::SRG's compile api. Does this work?

Thanks for your suggestions! What you are describing is basically what we have in mind for the multi-device SRGD class, except that you plan to implement it in the RPI::SRG instead. This would work, but we think it would still be nice to have a MDSRGD class to separate this functionality from the RPI, have it testable separately and consistent with the other multi-device classes which are implemented on the RHI rather than the RPI level. If you don't have any strong objections to having a separate multi-device SRGD class, we will implement it like this. You are of course right that the API of the RPI::SRG has to be adapted later then as well to reflect these changes in the MDSRGD class, when we port the RPI to use the MD classes.

yeah we could push the multidevice functionality to RHI::SRGdata. It would keep RPI clean. I am not opposed to pushing the functionality down to RHI and basically have a RHI::MultiDeviceSRGData contian multiple RHI::DeviceSRGdata and RPI is only aware of RHI::MDSRGData. That can work too.

Since this RFC is accepted please open a PR and move this RFC to this folder - https://github.com/o3de/sig-graphics-audio/tree/main/rfcs where we will track all the new RFCs. Thanks.

Since this RFC is accepted please open a PR and move this RFC to this folder - https://github.com/o3de/sig-graphics-audio/tree/main/rfcs where we will track all the new RFCs. Thanks.

Here you go: https://github.com/o3de/sig-graphics-audio/pull/137

@moudgils I've been working on using the MD resource classes in the RPI. Doing that I ran into the issue that I needed to use MD classes in the scope attachments. We also do so in our prototype. However, as I was thinking about it more, I realized that we might not want that.

The basic idea that we are following is that a scope (and thus a pass) is attached to one device. In the prototype for example, the MD SRGs are submitted for compilation by the passes. But that means depending on the device mask of that MD SRG, multiple SD SRGs are unnecessarily compiled, even though only one of them is required for the specific task. The better solution would be to transition from MD to SD already when submitting for compilation in the passes rather than between compilation and execution.

This would mean, that the passes need to be more aware of the device they are running on, i.e., call device specific methods of the RPI classes (passing their device index) that store the MD objects. An example here is the following (bringing us back to the scope attachments that triggered my thinking around this): The CompileResources method in the passes which is called during frame graph compilation may get an image view from the compile context via a scope attachment (which then would already be SD, not MD) and then binds this in an RPI SRG. That method would need to get the device index from the SD image view and only update the corresponding SD SRG.

What are your thoughts about this?

o3de / sig-graphics-audio

Proposed RFC Feature: Multi-device Resources #120

RFC: Multi-device Resources

Summary

What is the relevance of this feature?

Feature/Technical Design Description

DeviceObject classes without multi-device version

Other classes and structs with both a single-device and multi-device version

Switching from multi-device to single-device for execution

Planned git history

What are the advantages of the feature?

What are the disadvantages of the feature?

How will this be implemented or integrated into the O3DE environment?

Are there any alternatives to this feature?

How will users learn this feature?

Are there any open questions?

Proposed solutions

`AZStd::pair<MultiDeviceResource, ResourceViewDescriptor>`

Pass in resources separately