Add gpuCommandEncoder.fillBuffer8() and gpuCommandEncoder.fillBuffer32()

litherum commented 2 years ago

We've gone back and forth many times about fillBuffer() vs clearBuffer(), 8-bit values vs 32-bit values, and requiring the STORAGE usage vs not requiring it. There are various arguments in favor of various combinations of these things.

I don't necessarily think that polyfilling these inside the API is necessarily all that painful or difficult. This seems like the kind of thing where the value of providing more functionality is higher than the cost of providing it.

So, given that, here's a proposal:

// Does not require the STORAGE usage. Offset and size must be multiples of 4.
GPUCommandEncoder.clearBuffer(
    GPUBuffer destination,
    GPUSize64 destinationOffset = 0,
    GPUSize64 size = untilTheEnd);

// Require the STORAGE usage. Offset and size are in units of bytes and must be multiples of 4.
GPUCommandEncoder.fillBuffer8(
    GPUBuffer destination,
    octet value,
    GPUSize64 destinationOffset = 0,
    GPUSize64 size = untilTheEnd);
GPUCommandEncoder.fillBuffer32(
    GPUBuffer destination,
    unsigned long value,
    GPUSize64 destinationOffset = 0,
    GPUSize64 size = untilTheEnd);

	`clearBuffer()`	`fillBuffer8()`	`fillBuffer32()`
Metal	`MTLBlitCommandEncoder.fillBuffer()`	`MTLBlitCommandEncoder.fillBuffer()`	Compute shader polyfill
Direct3D	Whatever we do right now (presumably copying from a pre-zeroed buffer)	`ClearUnorderedAccessViewUint()`	`ClearUnorderedAccessViewUint()`
Vulkan	`vkCmdFillBuffer()`	`vkCmdFillBuffer()`	`vkCmdFillBuffer()`

One of the key observations is that, in every column above, at least 2 of the 3 APIs have a hardware-blit solution.

kainino0x commented 2 years ago

Looks like the ClearUnorderedAccessView* functions are not quite equivalent: they actually encode RGBA data into a particular format. However I'm pretty sure this is fine as we would make a UAV with the R32_UINT format? Just something worth noting.

kvark commented 2 years ago

IIRC, D3D12 implementation isn't that straightforward and involves a few annoying nuances: https://github.com/gfx-rs/gfx/blob/bc77309afdb0829605982370a3e17382c5968071/src/backend/dx12/src/command.rs#L2207-L2274

First, there is a transition into UAV state. It means some cache flushes. If multiple fillBuffer are called on the same buffer, we may need to detect this and issue an UAV barrier. If the user does copy_buffer_to_buffer followed by fill_buffer many times, i.e. zeroing out some chunks and copying others, then they may hit a snag on D3D12 specifically because of the barriers we have to put.

Then there is an issue about descriptors. ClearUnorderedAccessView requires both a descriptor in CPU heap and a descriptor in the active GPU heap. The active GPU heap on D3D12 is already a complicated matter, because it needs to have all resources from the used bind groups for each draw call. Adding another thing that needs to modify it is making the situation worse. Switching active heaps on D3D12 is documented to be heavy, too, if one does it naively like gfx-rs used to do.

So this is possible to implement on D3D12, but a good implementation isn't exactly straightforward. And it could still have unexpected performance issues due to barriers for UAV<->Copy states.

Finally, "Compute shader polyfill" is something gfx-rs was happy to add (because of Vulkan Portability), but I personally hoped that WebGPU will be designed as an intersection of the APIs, and that we'd not need that. I don't believe that filling buffers with arbitrary values is such an important issue to diverge from our general "intersect features" workflow. Users can already create a compute shader, and its performance will be predictable, at least.

kainino0x commented 2 years ago

Personally I'd like to see some real use-cases for these. Thinking about it further I'm still having a really hard time thinking of cases that would benefit. All I've thought of so far:

Filling with a sentinel {,U}INT{8,16,32}_MAX value (for marking uninitialized or unprocessed areas in an algorithm that can output 0 but can't output 0xFF/FFFF/FFFFFFFF), but not sure exactly when this could be useful
Filling an index buffer with primitive restart values, I guess?

So I'd really like to be convinced why this is actually useful and warrants significant effort implementation effort.

I could see value in something that sets the value of an entire array-of-structs (i.e. fillBuffer with a variable-sized argument), but I think that would always be implemented with compute and users can quite easily do that themselves (and initialize from WGSL without worrying about memory layout).

magcius commented 2 years ago

Do people need to fill values in buffers that aren't 0? I have never needed this and cannot imagine a use case.

kainino0x commented 2 years ago

Triage: marking v1.0 because Myles indicated it would be in the meeting just before this was filed. (TBD whether the group decides to postpone or close the issue.)

Kangz commented 2 years ago

@litherum any concerns moving this post-V1?

litherum commented 2 years ago

:( ok

gpuweb / gpuweb

Add gpuCommandEncoder.fillBuffer8() and gpuCommandEncoder.fillBuffer32() #2372