Lazy Evaluation of Tensors

This is a PR to fix the performance problem this library has so far suffered from.

Turns out, binding newly created buffers to WebGPU commands is expensive. You really want to reuse buffers not just to keep memory down, but also keep the GPU happy.

I tried a couple tricks:

Pooled buffers but the browser garbage collector wasn't aggressive enough to free buffers.
Sub buffers of large (max) size and a custom heap allocator but again the GC wasn't aggressive enough.

The heap allocator was the best, but the GC just wasn't freeing buffers fast enough for big computations.

Finally, I decided that the only way to make sure that buffers could be reused efficiently was to first compute the data flow graph and then translate that to SSA form, then from that check on the liveness of buffers. It's not so hard but is only worth it if the compute graph is more than a few nodes.

To enable growing large graphs, I have made tensors lazy be default. They only compute their value on demand - when toArrayAsync() is called or if the storage property is accessed. This is a big breaking change from how PyTorch works, but the performance gains are worth it.

Eager evaluation benchmark

32 GB of memory

Benchmark	Time (ms)	Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz	NVIDIA GeForce RTX 3090
unary 1d(1, 'neg')	0.149	0.001	0.003
unary 1d(1, 'sigmoid')	0.146	0.001	0.003
unary 1d(729, 'neg')	0.155	0.001	0.003
unary 1d(729, 'sigmoid')	0.147	0.002	0.003
unary 1d(2187, 'neg')	0.164	0.002	0.003
unary 1d(2187, 'sigmoid')	0.153	0.004	0.003
unary 1d(59049, 'neg')	0.204	0.013	0.008
unary 1d(59049, 'sigmoid')	0.170	0.058	0.008
unary 1d(177147, 'neg')	0.773	0.073	0.017
unary 1d(177147, 'sigmoid')	0.168	0.177	0.017
unary 1d(531441, 'neg')	2.123	0.409	0.051
unary 1d(531441, 'sigmoid')	1.264	0.732	0.050
unary 1d(1594323, 'neg')	3.654	1.488	0.166
unary 1d(1594323, 'sigmoid')	3.030	2.344	0.157

Lazy evaluation benchmark

4 GB of memory

Benchmark	Time (ms)	Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz	NVIDIA GeForce RTX 3090
unary 1d(1, 'neg')	0.030	0.001	0.003
unary 1d(1, 'sigmoid')	0.029	0.001	0.003
unary 1d(729, 'neg')	0.029	0.001	0.003
unary 1d(729, 'sigmoid')	0.029	0.002	0.003
unary 1d(2187, 'neg')	0.029	0.002	0.003
unary 1d(2187, 'sigmoid')	0.029	0.004	0.003
unary 1d(59049, 'neg')	0.031	0.013	0.008
unary 1d(59049, 'sigmoid')	0.031	0.058	0.008
unary 1d(177147, 'neg')	0.036	0.073	0.017
unary 1d(177147, 'sigmoid')	0.038	0.177	0.017
unary 1d(531441, 'neg')	0.048	0.409	0.051
unary 1d(531441, 'sigmoid')	0.050	0.732	0.050
unary 1d(1594323, 'neg')	0.089	1.488	0.166
unary 1d(1594323, 'sigmoid')	0.089	2.344	0.157

praeclarum / webgpu-torch

Lazy Evaluation of Tensors #7

Eager evaluation benchmark

Lazy evaluation benchmark