This is a PR to fix the performance problem this library has so far suffered from.
Turns out, binding newly created buffers to WebGPU commands is expensive. You really want to reuse buffers not just to keep memory down, but also keep the GPU happy.
I tried a couple tricks:
Pooled buffers but the browser garbage collector wasn't aggressive enough to free buffers.
Sub buffers of large (max) size and a custom heap allocator but again the GC wasn't aggressive enough.
The heap allocator was the best, but the GC just wasn't freeing buffers fast enough for big computations.
Finally, I decided that the only way to make sure that buffers could be reused efficiently was to first compute the data flow graph and then translate that to SSA form, then from that check on the liveness of buffers. It's not so hard but is only worth it if the compute graph is more than a few nodes.
To enable growing large graphs, I have made tensors lazy be default. They only compute their value on demand - when toArrayAsync() is called or if the storage property is accessed. This is a big breaking change from how PyTorch works, but the performance gains are worth it.
This is a PR to fix the performance problem this library has so far suffered from.
Turns out, binding newly created buffers to WebGPU commands is expensive. You really want to reuse buffers not just to keep memory down, but also keep the GPU happy.
I tried a couple tricks:
The heap allocator was the best, but the GC just wasn't freeing buffers fast enough for big computations.
Finally, I decided that the only way to make sure that buffers could be reused efficiently was to first compute the data flow graph and then translate that to SSA form, then from that check on the liveness of buffers. It's not so hard but is only worth it if the compute graph is more than a few nodes.
To enable growing large graphs, I have made tensors lazy be default. They only compute their value on demand - when
toArrayAsync()
is called or if thestorage
property is accessed. This is a big breaking change from how PyTorch works, but the performance gains are worth it.Eager evaluation benchmark
32 GB of memory
Lazy evaluation benchmark
4 GB of memory