Open PoignardAzur opened 2 months ago
Making a low-effort CPU runtime would probably be as hard as making a proper CPU runtime. To speed things up, we might generalize our CUDA compiler to a C++ compiler and compile it using gcc or llvm. The compiler wouldn't be embedded, but it would be faster to develop.
In your README, you mention wanting to build a JIT cranelift backend for the CPU.
I can see the appeal of such a backend, but at the same time, there are use-cases where users may really want a CPU runtime for their shaders and don't care that much about performance.
For instance, in Vello, we end up maintaining CPU pseudo-shaders in parallel of our actual WGSL shaders, mostly for testing and as a fallback. Personally, I'd like to push the fallback case even further so we can run Vello on machines without GPUs; in those cases, being able to run anything at all is a win, even with degraded performance. If we could achieve that and get rid of our duplicate CPU shaders, that would be a massive win for us.
Have you considered making a best-effort CPU runtime? One where annotated rust functions are simply lowered to regular rust functions, and you leave auto-vectorization to the rustc backend? How much effort do you think it would take to implement that runtime?