gpuweb / gpuweb

Where the GPU for the Web work happens!
http://webgpu.io
Other
4.68k stars 308 forks source link

Efficient Per-Frame/Transient Bind Groups #915

Open tklajnscek opened 4 years ago

tklajnscek commented 4 years ago

Quick TL;DR:

Is there a way to efficiently create transient bind groups in WebGPU? If so, what is it? If not, is the group willing to entertain the idea of a simple hint/flag to help with this?

The problem

Obviously as many bind groups as possible should be created up-front and then used over and over again, but there will always be some things unknown until or close to draw time.

For some resources like buffers, WebGPU has dynamic offsets which let us change the offset without re-creating bind groups. The limitation being that we're always within the same buffer, but that's mostly workable.

Unfortunately there's no such thing for textures. There are a lot of cases where textures are not known up-front such as render targets that are used as inputs to other draws. Currently there's no good option other than to create single-use bind groups, use them and throw them away at every draw, which seems wasteful since the underlying implementation is most likely not designed for this kind of usage. There's also probably a case to be made for dynamic buffers that are not offsets within just one buffer.

Now I do realize that you could say but you should know all the render targets up front and should be able to bake these as static bind groups but it's not that straightforward as your rendering pipeline complexity grows, some of it is even runtime generated (render graphs), and then you start throwing in pooled/transient render targets.

(Just one possible) Solution

To address this in our engine, we have the notion of dynamic bind groups which are optimized to allocate linearly and fill descriptors efficiently for all supported platforms:

These bind groups work exactly the same as regular ones, except that their lifetime is only a single frame, they don't have to be cleaned up, it's a simple fire and forget system.

Question/Proposal

Is the current spec of WebGPU enough to avoid performance issues with these? If so developers should just create the bind groups with each draw that needs them, use them and forget them immediately.

Or do you guys feel like it's worth either investigating this further, potentially adding a flag/usage hint to bind groups which lets the implementation handle these better/faster/lighter?

Kangz commented 4 years ago

It's pretty clear that bindgroup creation will be one of the biggest hotspots of the API.

In the original NXT API there was a MUTABLE flag on bindgroups that was supposed to allow changing the content of the bindgroup on the device timeline. (the idea was it would orphan the previous content and create a new bunch of descriptors, depending on the underlying API). It's kind of similar to the hint you suggested.

IMHO we'll need a mechanism like that, eventually. The problem is that we have too little experience with applications / engines using WebGPU to know the exact problem to address. That's why I think MVP should not have such a mechanism, to avoid designing ourselves in a corner. Then based on aggregate feedback we'll be able to figure out what's the best solution.

tklajnscek commented 4 years ago

Thanks for the quick response!

That makes sense. If there's a regular release cycle planned after MVP that should work out fine. I might just be a bit traumatized with the whole WebGL 1 to WebGL 2 transition that never happened so I thought I'd bring it up sooner than later :)

For what it's worth, the primary use cases for dynamic/transient bind groups in our engine are roughly:

magcius commented 4 years ago

Unfortunately there's no such thing for textures. There are a lot of cases where textures are not known up-front such as render targets that are used as inputs to other draws. Currently there's no good option other than to create single-use bind groups, use them and throw them away at every draw

There's another option here: cache them at the user layer. That's what I do. For my implementation, you a way to compute a "hash code" for a BindGroup (which means all resources need a global unique ID). And a way to compare two BindGroups for equality, but this lets me cache off BindGroups, and after a few frames the app isn't caching anything anymore.

I wish JavaScript had some form of "pointer address" that would let me get a HashCode or unique ID per object (would make so many things much easier!), or a HashMap that would let me override the hash / equality test, so I don't have to write them myself, as it is a bit of infrastructure to write.

One other option here is to mandate that bind groups will be cached by browsers so that createBindGroup() with the same arguments will return the same object (bonus points if it returns the same JS object to cut on GC costs, though WebIDL might hate that). Perhaps it makes sense to bake a cache into the API, with a createIfMissing parameter where createBindGroup() will return null if the object is uncached, so the application can determine if it wants to take the creation hit.

Kangz commented 4 years ago

@tklajnscek thanks for the detailed description! It will help a lot make sure these usecases are addressed.

I wish JavaScript had some form of "pointer address" that would let me get a HashCode or unique ID per object

Maybe it's possible to do that using Javascript WeakMaps?

One other option here is to mandate that bind groups will be cached by browsers so that createBindGroup() with the same arguments will return the same object (bonus points if it returns the same JS object to cut on GC costs, though WebIDL might hate that). Perhaps it makes sense to bake a cache into the API, with a createIfMissing parameter where createBindGroup() will return null if the object is uncached, so the application can determine if it wants to take the creation hit.

That's something we'd very much like to avoid, because bindgroups should be cheap objects to create, and caching them adds a contention point, and a lot of additional computations for bindgroups that aren't reused. Caching on both side of the IPC barrier is even more costly. Maybe it would become palatable if there is a "cache this" hint on bindgroups, not sure.

magcius commented 4 years ago

Maybe it's possible to do that using Javascript WeakMaps?

We need to do structural hashing of JS objects. i.e. look up { bindings: [{ texture: textureResource }] } by comparing just the values in each field. As far as I know, that's not possible, even with WeakMaps. But if you have any other solutions, please let me know. You can see my crummy solution here.

https://github.com/magcius/noclip.website/blob/master/src/HashMap.ts https://github.com/magcius/noclip.website/blob/master/src/gfx/render/GfxRenderCache.ts

That's something we'd very much like to avoid, because bindgroups should be cheap objects to create

Isn't there still an IPC overhead to this? Even if they're cheap, we'd be trading off retained memory cost against creation cost (IPC overhead) and GC cost.

In my view, if transient buffer groups are immutable, they would still have the IPC and GC costs to them, though the implementation could recycle them sooner -- not that helpful. If they are recyclable, then that would work, but tracking, validation, and async might make that tricky (and might negate the gains).

kainino0x commented 4 years ago

That makes sense. If there's a regular release cycle planned after MVP that should work out fine. I might just be a bit traumatized with the whole WebGL 1 to WebGL 2 transition that never happened so I thought I'd bring it up sooner than later :)

has taken a very long time to happen* :)

A change like this wouldn't be like the WebGL 1 to 2 transition though, it would be more like the addition of a small WebGL extension (which generally moves with a quicker pace), but higher priority for implementers because it would wouldn't be considered an "optional" feature like hardware features are.

There are a lot of cases where textures are not known up-front such as render targets that are used as inputs to other draws.

nit: this particular usecase is usually served fine by static bind groups, as the render graph looks the same every frame, unless you're trying to readback from the swapchain texture (which may have other issues).

kainino0x commented 4 years ago

(oops, sent too early) Of course the transient bind groups would enable the architecture you mention of having reusable pools of resources, and others.

magcius commented 4 years ago

That's not 100% true. Post-process graphs can change depending on what's needed in the frame, and for those of us with "dynamic surface allocation" (aka pretty much any modern engine) skipping a post-process that requires its own surface means that the rest of the chain will be off-by-one. You can think of a pool of surfaces with linear allocation, and each entry in the pp chain takes however many N surfaces it needs. If you turn off Bloom but Depth of Field, Motion Blur, Outline are still on, then those postprocess effects will now use different surfaces for their temporaries.

kainino0x commented 4 years ago

Ah, good point; I thought about the fact that there would likely be more than one possible render graph (hopefully smallish finite number, still), but didn't think about how that disrupts the bindings for the pipeline.

kvark commented 4 years ago

These are all very interesting ideas! For me, there is still an elephant in the room: how much would that give us versus the current approach (where the user would create new bind groups). Suppose that you had an ability to "mutate" a bind group. On Vulkan, it would be vkUpdateDescriptorSets, so the only thing saved here (comparing to the creation of a new bind group) is allocating/freeing descriptor sets. How much is this really going to show up in profiles is an open question to me.

Therefore, I'm on the same page as @Kangz - let's release MVP and see.

GPUBindGroupArena

There is an idea of a primitive that wasn't discussed here. Something like a "bind group arena", i.e. an object that holds the lifetimes of all the bind groups created with it. All of them will be released together when the arena is no longer referenced by either CPU or GPU. So what the user could do is creating an arena every frame, and allocating the "transient" bind groups with it. This way, the temporary bind groups would not fragment the descriptor pool space, since their allocations wouldn't be mixed up with the more stable groups. Also, our implementation, at least on Vulkan, would be able to use a more efficient strategy of allocating them. More specifically the underlying pools would not have the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag on them, and descriptors wouldn't need to be individually released.

This is somewhat non-intrusive. Could be a GPUDevice.createBindGroupArena() plus a field in GPUBindGroupDescriptor to point to it. And I believe it can ripe most of the benefit of the original "transient bind group" idea.

magcius commented 4 years ago

Creating bind groups should be fairly cheap (pretty much anything allocated from a pool should be cheap -- that's the main reason for pools to exist, to move the allocation cost of a lot of objects up-front), so most of the cost is going to be the JS/GC/IPC overhead cost. Mutable/transient bind groups don't have to be backed by vkUpdateDescriptorSets, they can be allocated in a per-frame pool or similar on the backend. The goal is mostly to avoid those overheads, I imagine.

Agreed on waiting until MVP is out before making any harsh decisions.

bbernhar commented 3 years ago

I do. The cost of bindgroup creation comes from where they will be allocated from. Bindings need to co-exist in "online heaps" and this is finite space, when exhausted, leads to severe pipeline flushes.

Some APIs like D3D12, expect manual management by the user of where best to place them (root space vs heap space). Since this is not exposed to the WebGPU developer, the runtime will be on the hook to figure this out either explicitly (ex. hints) or implicitly (ex. promotion).

kvark commented 3 years ago

@bbernhar I'm curious if this is just a Dawn performance issue that is on the radar, versus a conceptual problem with bind groups in the API.

bbernhar commented 3 years ago

Conceptually, it is sound but whether or not one can consider BindGroups to be "light weight" will depend on the WebGPU runtime's ability to mitigate (not eliminate) the overhead of this translation (not Dawn specific).

If the BindGroup API could help make performance more portable (ex. hints), it's certainly worth looking into.

magcius commented 3 years ago

Bind groups map to descriptor tables in D3D12. Root CBVs don't have any mapping, is my understanding.

bbernhar commented 3 years ago

I wouldn't mix the concept of a BindGroup with a implementation of them. There is no rule they must live in a heap and could be inefficient since it requires management. If I'm rendering with a few / same set of resident SRVs, putting those in a table via BindGroup makes little sense to begin with then repeatedly doing so makes a bad situation even worse.

benvanik commented 2 years ago

(don't mind me, just necroing this thread 🧟 :)

Systems performing dynamic planning such as FrameGraph will suffer from this API limitation. We're building middleware and it hurts especially hard there as we aren't dealing with a static single-source single-author pipeline as most samples are doing but something composed of a lot of dynamic behavior.

Here I'm seeing some unavoidable and pathologically bad behavior with the way implicit barriers and immutable unpoolable bind groups interact. Since barriers are ostensibly defined based on the buffers and ranges bound during dispatch if one was trying to ensure maximum available concurrency they would want to tightly specify the buffer ranges per dispatch. This isn't currently possible in a reasonably sophisticated dynamic system without effectively allocating new bind groups per dispatch as suballocation and multi-frame pipelining via ringbuffers ensure that the ranges involved are almost always unique (even if just practically so).

Dynamic offsets don't reliably help here in reducing the combinatorial explosion of unique binding groups as to use them with dynamic sizes the binding entry size needs to be undefined (WGPU_WHOLE_SIZE) - inducing a lot of false dependencies that implementations don't have sufficient information to avoid. The maximum dynamic offset count is also quite low (maxDynamicStorageBuffersPerPipelineLayout>=4) such that it's not always practical. This would be less of an issue if there were push constants or inline buffer updates but passing dynamic parameters efficiently in a uniform ringbuffer uses one of those dynamic bindings and limits it even further.

One mitigation for the false dependencies may be allowing dynamic sizes to be passed with dynamic offsets - even if not used by underlying systems. A mitigation for the churn would be the proposed GPUBindGroupArena that would prevent thousands of bind groups per frame from putting pressure on the system (GC thrashing, call overhead, etc). Push descriptor set-like commands (or Metal setBinding) would solve both issues where if unavailable in the system the implementation-side pooling of command buffer-local bind groups would be significantly more efficient than making full calls through the API stack and managing the groups in user code.

For now we will just new up bind groups for every use - it's not good (thousands/frame even in small scenarios) and reading the thread I'm not sure there's any additional data needed to know that this is a problem - so consider this a strong vote for a fast-follow on the MVP with some kind of help here for more complex applications :)

Kangz commented 2 years ago

Since barriers are ostensibly defined based on the buffers and ranges bound during dispatch if one was trying to ensure maximum available concurrency they would want to tightly specify the buffer ranges per dispatch.

While this might happen in the future (and something we want to do in Dawn), implementations aren't that smart yet, and I don't think they will be before v1 (it's an optimization, and correctness is more important than optimizations to ship a Web API).

Dynamic offsets don't reliably help here in reducing the combinatorial explosion of unique binding groups as to use them with dynamic sizes the binding entry size needs to be undefined (WGPU_WHOLE_SIZE)

The entry size must be defined for dynamic buffers otherwise there will be a validation error when a non-zero dynamic offset is used.

I'm not sure there's any additional data needed to know that this is a problem

The problem is not having data to know that this will be a problem, but more that the solution we'll need will depend on what complex applications do, which we have 0 data about at the moment. We can design something today, but it might end up being the wrong solution, and then we have to carry that forever.

kvark commented 2 years ago

@benvanik there is a lot to digest in your post. If you are interested to communicate your feedback clearer, consider filing a discussion and providing a bit more information on each of the points. I feel like there is a bit of guesswork involved in answering this directly as it's written. However, it's definitely interesting, and we'd like to understand better!

Since barriers are ostensibly defined based on the buffers and ranges bound during dispatch if one was trying to ensure maximum available concurrency they would want to tightly specify the buffer ranges per dispatch.

First of all, we already know about all the buffer ranges your dispatches use. I don't see what extra information you'd need to provide. So it's up to implementations to ensure the barriers aren't needed in certain situations. However, it's quite important that barriers based on buffer ranges are only a thing in Vulkan. In contrast, D3D12 transitions whole resources at once (into different resource states), so tracking ranges doesn't help. So here we are only talking about a Vulkan-specific internal WebGPU optimization, which our implementation will consider at some point later.

Dynamic offsets don't reliably help here in reducing the combinatorial explosion of unique binding groups as to use them with dynamic sizes the binding entry size needs to be undefined (WGPU_WHOLE_SIZE)

This is incorrect. The binding size for a dynamic-offset buffer binding is specifying the size of a moving window. It can not be undefined, since this would mean that the "window" covers the whole buffer, and thus it can't be moved even a byte forward. In wgpu, this is one of the most popular newcomer issues.

The maximum dynamic offset count is also quite low (maxDynamicStorageBuffersPerPipelineLayout>=4) such that it's not always practical.

Nothing in your post suggests that you need storage buffers specifically. You can use uniform buffers in addition - 8 of them. There is, of course, cost associated with using dynamic offsets. I can't imagine why you'd need that many.

Another solution you could do is just binding all the data in a storage buffer with an unsized array, and then indexing the data in the shader based on an index you obtain elsewhere.

A mitigation for the churn would be the proposed GPUBindGroupArena that would prevent thousands of bind groups per frame from putting pressure on the system (GC thrashing, call overhead, etc)

I still think this would be a good addition, and now wgpu is in a much better position to experiment with it. But by no means this should affect the WebGPU shipping schedule.

benvanik commented 2 years ago

Ouch, thanks for pointing out that WGPU_WHOLE_SIZE was not allowed for dynamic bindings - definitely hadn't caught that and was about to build a GPUBindGroupArena-alike assuming it would work. Is that because the validation is happening when the bind group is created (createBindGroup) vs. when it is bound/used and the dynamic offsets are available as in Vulkan?

Unfortunately that makes this scenario even worse: if trying to reuse bind groups you'd then only be able to share ones that had the same exact sizes for all bindings within the group (or overallocate by maxUniformBufferBindingSize/maxStorageBufferBindingSize such that any offset would still be valid) - leading to more churn.

Another solution you could do is just binding all the data in a storage buffer with an unsized array, and then indexing the data in the shader based on an index you obtain elsewhere.

That loops back to what I mentioned about the spec leaving the implementations with the only way to extract concurrency/fill bubbles from multiple dispatches being by having tons of unique subrange bind groups + complex tracking or dedicated/duplicate buffers for all potentially cross-dispatch resources. Reducing the churn requires reusing bind groups (or having mutable bind groups) but if bind groups are impractical to reuse because they limit concurrency when specified as whole-buffer then I wish there was a more direct path (push descriptors/setBinding/etc) that allowed for that validation to happen at bind time. There's a lot of overlapping concerns when it comes to scheduling asynchronous work and what works in some APIs within their entire execution model available doesn't always work well piecemeal :(

When trying to get good utilization on a parallel system you don't want false dependencies introducing pipeline bubbles. If dispatch A and dispatch B work on mutually exclusive subranges of a larger buffer then you'd want them to be able to overlap (run concurrently and have B start before all of A completes, etc). Unfortunately here in WebGPU with implicit barriers we (currently) only have mechanisms that operate on entire buffers and introduce false dependencies for RAW/WAW/WAR of any use of those buffers. Being able to specify subregions is a potential escape hatch but as pointed out no one currently does anything with them and may never :(

But even if nothing today uses the subregion dependencies a GPUBindGroupArena would help with the churn and allow for user-mode to at least tell the implementation about those fine-grained dependencies. Of course it's all still just a workaround for not having explicit barriers or another kind of logical work grouping within passes - or an immediate-mode setBinding API. If there were alternative ways to say "these two dispatches may have read-write on the same buffer but will not interfere" (or cooperate with atomics) that'd be best - lower overhead as no bind group trickery and just bind whole buffer ranges, less implementation tracking, and better utilization - and maybe that would negate the need for a lot of this.

The unfortunate tradeoff is to have WebGPU be X% slower than native on the same hardware because no work can overlap and utilization is lower or have WebGPU use Y% more memory than native because each transient resource is put into its own dedicated allocation in order to make the coarse-grained buffer-based barrier insertion work. Would be really nice to explore how to avoid that tradeoff in future versions of the spec. We'll happily provide some data (once we can get something working :) - lots of prior experience with the tradeoff from the GL days but we should be able to get real apples/apples numbers here.

magcius commented 2 years ago

Bind groups should be inexpensive to create. If they're expensive, something has gone wrong.

When trying to get good utilization on a parallel system you don't want false dependencies introducing pipeline bubbles. If dispatch A and dispatch B work on mutually exclusive subranges of a larger buffer then you'd want them to be able to overlap (run concurrently and have B start before all of A completes, etc). Unfortunately here in WebGPU with implicit barriers we (currently) only have mechanisms that operate on entire buffers and introduce false dependencies for RAW/WAW/WAR of any use of those buffers.

In D3D12, resource states apply to the whole resource, no? So you'd need to split your resources up along the fault lines of resource states anyway. You could use placed resources and suballocate resources from a larger buffer, but we still need to create an object to track the states.

benvanik commented 2 years ago

The issue is when you have thousands of bind groups being created in a futile attempt to let the implementation elide barriers by letting it know there are no hazards - even something individually inexpensive can cost a lot at scale :)

(for compute, what I'm dealing with) in D3D12 transient buffers are almost always in D3D12_RESOURCE_STATE_UNORDERED_ACCESS and very occasionally transitioned for other uses (copies/indirect arguments/etc - or into/outof compute to graphics). Almost all ResourceBarriers are of type D3D12_RESOURCE_BARRIER_TYPE_UAV and if there's a RAW on independent ranges of a resource you just don't insert a barrier there - a relatively mundane superpower of explicit barriers :)

The issue here is that WebGPU and its implementations are taking the role of inserting or omitting those barriers but not doing it with the same fidelity an application written against D3D12 (or Vulkan/etc) can do --- unless they use the subrange information which can (often) only be communicated with new unique bind groups per ~dispatch. The most robust/efficient/predictable solution is to allow for explicit barrier management - but with what is currently in the API providing binding ranges and hoping implementations are able to efficiently use them is all we have. GPUBindGroupArena or an immediate-mode setBinding API would make the user-mode half more efficient but whether the implementations can effectively utilize that information remains to be seen. That's one of the classic OpenGL/pre-D3D12 things we all wanted to avoid with modern APIs, but I'm hopeful :)

kvark commented 2 years ago

If dispatch A and dispatch B work on mutually exclusive subranges of a larger buffer then you'd want them to be able to overlap (run concurrently and have B start before all of A completes, etc). Unfortunately here in WebGPU with implicit barriers we (currently) only have mechanisms that operate on entire buffers and introduce false dependencies for RAW/WAW/WAR of any use of those buffers

Just to be clear, what's written here is not right. WebGPU has implicit barriers, and implementation is in its full right to omit a UAV barrier between dispatches if they use non-intersecting sub-ranges of the same buffer. After all, such a barrier would be inobservable by the user. The specification defines states for whole resources, but in cases like this it can be followed without actually inserting barriers between dispatches.