microsoft / DirectXShaderCompiler

This repo hosts the source for the DirectX Shader Compiler which is based on LLVM/Clang.
Other
3.05k stars 677 forks source link

Buffer Address Proposal: Create buffers from other buffers, similar to loading from ResourceDescriptorHeap #4732

Closed Tobski closed 1 year ago

Tobski commented 1 year ago

ResourceDescriptorHeaps are an incredibly flexible way of accessing descriptors, but they imply that all descriptors are the same size, which is untrue more or less universally across vendors. As a simple example, an unformatted buffer is generally going to look like an address and a size, whereas an image has significantly more data associated with it.

Vulkan Working Group members (including AMD) have looked at preliminary data suggesting that having all descriptors in a homogenous array results in notable slowdown (high single to low double digits) for some apps compared to using separate parallel arrays, verified by experiments with VK_EXT_mutable_descriptor_type.

Some mitigation strategies exist for drivers, but generally speaking we'd like to see developers given tools to manage this better themselves as we move to increasingly "bindless" ways of managing descriptors (and thus less scope for driver intervention). A key idea we've identified is that if we could separate buffers (and acceleration structures) out from the main resource heap, and provide these in a separate array, it would allow developers the option to pull these from a more tightly packed array with limited shader changes, and we believe this can recover most or possibly all of the performance compared to just using a single flat array.

Ideally, we'd like to see this addressed in a way that works for both Vulkan and DirectX, without requiring any API changes. One potential path we see is to use buffer addresses for this purpose. In both Vulkan and DX12 it's possible to obtain a GPU address for a buffer resource, and in Vulkan it's possible to further use this in the shader in lieu of a buffer descriptor, though it's exposed in HLSL in the vk namespace as a simple load from address, and doesn't fit very neatly with the rest of HLSL. We also want to avoid doing something where we add pointers to HLSL, as this is both an enormous task and may be undesirable as HLSL is a largely "robust" language which adding significant pointer support could compromise.

An option we'd like to entertain is providing a way to load a buffer from another buffer in the same way as one would load a buffer from the ResourceDescriptorHeap, roughly as follows:

// Root buffer acting as a heap
StructuredBuffer<uint64_t> BufferHeap;

[numthreads(32,1,1)]
void CSMain()
{
    // Resource Heap syntax
    RWStructuredBuffer<float4> myBuffer = ResourceDescriptorHeap[0];

    // Rough proposed new syntax
    RWStructuredBuffer<float4> myBuffer2 = BufferHeap[0];
}

There are questions to ask about what limitations might be reasonable on this (e.g. can that uint64_t value be anywhere?), and/or whether there should be a more explicit cast than is provided from ResourceDescriptorHeap or some other syntax. It might also be reasonable to have uint64_t pairs (address and size) rather than just an address, to provide more robustness guarantees, possibly optionally.

This would give developers access to buffer addresses, in a way that is hopefully in keeping with existing HLSL syntax. Additionally, it avoids introducing pointers to the language, but still gives developers a way to describe fairly complex recursive data structures if they so wish, or pull data in ways that were previously not possible in HLSL. There's some danger of providing developers with too much rope here, but we believe this should be close to the minimum viable proposal for adding this functionality.

So some questions:

  1. Is this of interest to ISVs?
  2. Is this of interest to Microsoft?
  3. Are there any constraints which should be considered with this?

We could work to add this to the vk namespace, but if this is interesting to those targeting DX12 it might be a feature that makes sense in the next Shader Model so it can work for any target API.

jeremyong-az commented 1 year ago

I'm excited about the prospect of an analogous feature for ResourceDescriptorHeap in Vulkan!

Regarding the bindless buffer access, my feeling is that many ISVs already suballocate within buffer resources to do this indirection because of the alignment requirement correct? I'm curious how the single to double digit slowdown figure was estimated because if there are workarounds that already use existing code paths, keeping the API simpler (and similar to what exists already in DX12) may still be preferable.

Tobski commented 1 year ago

I'm excited about the prospect of an analogous feature for ResourceDescriptorHeap in Vulkan!

Yes that's sort of implied here I suppose - didn't need to raise an issue for that because #4255 exists :)

Regarding the bindless buffer access, my feeling is that many ISVs already suballocate within buffer resources to do this indirection because of the alignment requirement correct?

Hard to say, though I do know for sure there are apps that do not do this.

I'm curious how the single to double digit slowdown figure was estimated because if there are workarounds that already use existing code paths, keeping the API simpler (and similar to what exists already in DX12) may still be preferable.

Primarily this is experiments via games running on vkd3d-proton, which supports both parallel arrays and a homogenous one; allowing us to do a fairly quick switch by hiding the mutable descriptor type extension and comparing results. We'd still be aiming to support ResourceDescriptorHeap for any and all descriptors, but we would like to see this option for developers as well.

yuriy-odonnell-epic commented 1 year ago

From Unreal Engine perspective, avoiding buffer descriptors entirely in favor of uint64 pointer-like mechanism is extremely interesting. In our increasingly-GPU-based rendering architecture, we have to maintain a various GPU-side data structures that represent our scene. We store most of the persistent GPU-side scene data in large contiguous buffers, which pose pretty major challenges when parts of the scene are loaded / unloaded. We see significant hitches when we have to grow/shrink these buffers (realloc + copy costs). Additionally, the we can quite easily hit the 2GB buffer size limit in some of the larger scenes. Having the true bindless buffers (i.e. just the address) would allow us to implement much more efficient schemes for organizing our persistent data, such as chunked arrays. This also solves the big functionality gap that we see when looking at transitioning our DXR code from local root signature based "bindful" scheme to explicit bindless resources. We currently use root descriptors to bind some data per SBT record without creating buffer descriptors or touching the descriptor heap, however there is no direct equivalent to this in the pure SM 6.6 world. We have to either create some buffer descriptors per SBT record (which can be prohibitive), or somehow try to pool all such cases into one buffer, or invent some scheme to encode a buffer descriptor index and offset in one uint32/64.

While all problems are possible to work around with some amount of effort, it's sad to have to do it considering that Vulkan had a solution for this problem for years now. I very much wish for the RWStructuredBuffer<float4> myBuffer2 = SomeU64Address; or equivalent feature to be a native thing that HLSL / D3D does.

StarsX commented 1 year ago

I just imagined this before. If a raw buffer (byte address buffer) index could be uint64, and root uav virtual address setting could be 0 as the start address on the API side, this would be just a memory address pointer.

jeremyong-az commented 1 year ago

Before we go here (and I agree that binding addresses to shader resources would be great), I wanted to mention again that I think having virtualized bindings similar to the register and space convention in DX12 supported in SPIR-V/Vulkan would be excellent as a precursor or incremental step. As it stands, all ISVs need to abstract bindings at build time as opposed to runtime where the actual mapping of resources to root signature entries ought to occur.

Tobski commented 1 year ago

Before we go here (and I agree that binding addresses to shader resources would be great), I wanted to mention again that I think having virtualized bindings similar to the register and space convention in DX12 supported in SPIR-V/Vulkan would be excellent as a precursor or incremental step. As it stands, all ISVs need to abstract bindings at build time as opposed to runtime where the actual mapping of resources to root signature entries ought to occur.

I'd rather we not bog this issue down with orthogonal requests, but suffice to say that is something we are thinking about in Vulkan.

jeremyong-az commented 1 year ago

I'd rather we not bog this issue down with orthogonal requests, but suffice to say that is something we are thinking about in Vulkan.

I'm glad to hear it, I mentioned it mainly because from an ISV perspective, convergence of any sort is appreciated (which includes the buffer address proposal above)

Tobski commented 1 year ago

I had a bit of feedback internally that it might be desirable to hide the address and/or pair it with a size in some way at the HLSL level, to make sure that HLSL remains a robust language. The important part here IMO is that it lets you define and access recursive data structures, so I suspect it's fine to do that. The simplest option might be to simply allow ByteAddressBuffer or StructuredBuffer types to exist in buffers, with semantics of "it's a uint64_t address and then size in memory", so that the above example becomes:

// Root buffer acting as a heap
StructuredBuffer<RWStructuredBuffer<float4> > BufferHeap;

[numthreads(32,1,1)]
void CSMain()
{
    // Resource Heap syntax
    RWStructuredBuffer<float4> myBuffer = ResourceDescriptorHeap[0];

    // Rough proposed new syntax
    RWStructuredBuffer<float4> myBuffer2 = BufferHeap[0];
}

However I suspect it might be worth at least making the array type generic in some sense, which could either be achieved by allowing casts between buffer types, or introducing a new opaque "BufferHandle" class which can be cast to those buffer types, maybe with a way to obtain subranges.

This is of course a bit more opaque but still enables the desired functionality.

zhaijialong commented 1 year ago

having all descriptors in a homogenous array results in notable slowdown

@Tobski just out of curious, does that mean the SM6.6 bindless in D3D12 also suffer from the same problem (on some hardwares) ?

Tobski commented 1 year ago

@zhaijialong I understand that it's possible to work around some of these issues in D3D12.

Tobski commented 1 year ago

FYI I've raised this again on the new hlsl specs repo (https://github.com/microsoft/hlsl-specs/issues/17) since that seems like a more appropriate place for this, though atm I'm not clear if it needs runtime changes.

Closing this issue in favour of that new issue.