Hand-tiling matrix multiply in Dex speeds it up from ~100ms to ~5.5ms on my laptop (for a 500x500x500 dense multiplication).
There are some caveats, though:
The tile sizes are just arbitrary numbers, and there is no tuning to different hardware.
The hand-tiled version relies on Writer to construct the output in place, which defeats output fusion. (But then again, I don't know that the previous implementation would have fused well on the output either.)
Adding @noinline to this nerfs its performance back to ~40ms, presumably because it defeats LLVM's vectorizer (but I'm not sure why it does that).
Hand-tiling matrix multiply in Dex speeds it up from ~100ms to ~5.5ms on my laptop (for a 500x500x500 dense multiplication).
There are some caveats, though:
@noinline
to this nerfs its performance back to ~40ms, presumably because it defeats LLVM's vectorizer (but I'm not sure why it does that).