Open nathanielsimard opened 2 weeks ago
Briefly looked at implenting this, some thoughts:
Handles are long-lived and are to be held by a type (often a tensor). Based on the number of data holders referencing the value, you may or may not safely perform in-place operations (using the method can_mut). Bindings are arguments to be sent to the server. They don't hold all of the reference counts, since they don't impact mutability. However, they have another reference to track whether their buffer has been correctly registered in the GPU's queue (flush) to perform deallocation or reassignment.
I think you've identified the most challenging aspect: we need to allocate the ID without the server and in advance. The handle itself can't know if it is mapped or not, since it isn't mutable, but the server should reserve the memory when registering a task to the GPU's queue.
I don't think this is a problem, is it? You can write to the same buffer in multiple kernels in the same encoder, I believe. If not, we may need to track buffers visibility in the encoder and flush it dynamically based on potential conflicts. It would probably be challenging to support, but not impossible.
let output = client.empty();
// I imagine Backend::zeros() at some point resolves to a kernel. Or imagine some other "init" kernel.
client.execute(Kernel::zeros(), [output]);
client.execute(Kernel::MyCustom(), [a, b, c, output]);
You would need to guarantee the output doesn't re-use the buffer from a, b, c, but, it gets allocated in the first execute call where it seems there are no conflicts.
To be fully general you'd need the full execution graph pretty much.... that seems like a lot, but, this "init" case definitely occurs for me.
Currently, when calling
client.empty(bytes)
, the tensor handle is allocated immediately. This can lead to situations where input handles and output handles share the same memory chunk during kernel execution, which violates WebGPU specifications.Potential Solution
1) Introduce mapped and unmapped tensor handles. 2) Return an unmapped handle when calling
client.empty(bytes)
. 3) Add amap
method in the memory management trait, which acts like reserve but with an exclusion list. 4) Call that method in cube runtimes, with an exclusion list for WebGPU.