Open yurydelendik opened 3 months ago
@llvm/issue-subscribers-backend-webassembly
Author: Yury Delendik (yurydelendik)
It looks like we already do some optimization of unused/undefined shuffle indices to generate better patterns: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp#L2321-L2330. That's probably the best place to add more such optimizations.
Should we try to disable horizontal reductions for small lane sizes? A 8-bit reduction would result in 4 byte shuffles, which are still pretty slow.
Disable as in error out if the frontend tries to use them, or disable as in bail out to some other lowering scheme?
Currently working on issue https://bugzilla.mozilla.org/show_bug.cgi?id=1887312 . I discovered inefficient/high-cost (from WebAssembly compilation point of view) shuffles. These shuffles generate more than 4 native instruction that also could read a constant from memory.
Example is:
I think it is the result of something like:
The shuffles in the example above do not really need to specify a particular lane (0), from logic point of view. If this lanes value will be chosen as different values instead of 0, the Wasm compiler could produce far more efficient instructions, but these may benefit specific CPU.
I opened an issue about masked lanes https://github.com/WebAssembly/flexible-vectors/issues/66 with more examples.
In SpiderMonkey, we already matching some shuffle patterns to better select CPU instructions (https://searchfox.org/mozilla-central/source/js/src/jit/ShuffleAnalysis.h), but these shuffles hard to match.
This is more RFC issue to find out internals of WebAssembly target and auto-vectorization. And is it possible to improve generation of the shuffles in Wasm when some lanes does not matter.
/cc @ppenzin @tlively