We have a "conflicting" execution model now between CPU and GPU. GPU is applied lazily (i.e we construct a compute graph of all operations and then dispatch to the GPU, for CPU we apply as we build the graph).
Many operations not yet implemented.
Matmul optimizations required for performance to be acceptable. Candle did well here with some wasm128 magic.
The Ratchet CPU backend is nearly here!
Some remaining work to be done:
wasm128
magic.