Deferred tensor allocations to avoid buffer sharing with conflicting visibility in WebGPU

nathanielsimard commented 2 weeks ago

Currently, when calling client.empty(bytes), the tensor handle is allocated immediately. This can lead to situations where input handles and output handles share the same memory chunk during kernel execution, which violates WebGPU specifications.

Potential Solution

1) Introduce mapped and unmapped tensor handles. 2) Return an unmapped handle when calling client.empty(bytes). 3) Add a map method in the memory management trait, which acts like reserve but with an exclusion list. 4) Call that method in cube runtimes, with an exclusion list for WebGPU.

ArthurBrussee commented 2 weeks ago

Briefly looked at implenting this, some thoughts:

The difference of handle vs binding makes this a bit trickier. I'm not entirely sure what the difference between handles & bindings are.
Making the handle either mapped or unmapped is hard, as then the handles would need to be mutable for eg. execute after the allocation has happened. Instead maybe the memory manager can have ID's which aren't yet allocated, but, that has it's own complications.
The solution still isn't entirely general, as there might be multipllle writes to the same buffer from different kernels. One common case - first zero-ing a buffer, then writing the actual (sparse) results. If you're careful enough how things are allocated that might work but I'm not sure.

nathanielsimard commented 1 week ago

Handles are long-lived and are to be held by a type (often a tensor). Based on the number of data holders referencing the value, you may or may not safely perform in-place operations (using the method can_mut). Bindings are arguments to be sent to the server. They don't hold all of the reference counts, since they don't impact mutability. However, they have another reference to track whether their buffer has been correctly registered in the GPU's queue (flush) to perform deallocation or reassignment.
I think you've identified the most challenging aspect: we need to allocate the ID without the server and in advance. The handle itself can't know if it is mapped or not, since it isn't mutable, but the server should reserve the memory when registering a task to the GPU's queue.
I don't think this is a problem, is it? You can write to the same buffer in multiple kernels in the same encoder, I believe. If not, we may need to track buffers visibility in the encoder and flush it dynamically based on potential conflicts. It would probably be challenging to support, but not impossible.

ArthurBrussee commented 1 week ago

Ok sure I see! That's good to know, thank you
Yeah indeed :/ I see you got a bit further in the branch you started but does look like it might require a bunch of surgery of the innards.
What I mean, in this code:

let output = client.empty();
 // I imagine Backend::zeros() at some point resolves to a kernel. Or imagine some other "init" kernel.
client.execute(Kernel::zeros(), [output]);
client.execute(Kernel::MyCustom(), [a, b, c, output]);

You would need to guarantee the output doesn't re-use the buffer from a, b, c, but, it gets allocated in the first execute call where it seems there are no conflicts.

To be fully general you'd need the full execution graph pretty much.... that seems like a lot, but, this "init" case definitely occurs for me.

tracel-ai / burn

Deferred tensor allocations to avoid buffer sharing with conflicting visibility in WebGPU #1996

Potential Solution