[WebAssembly] Masked lanes in auto-vectorization

yurydelendik commented 3 months ago

Currently working on issue https://bugzilla.mozilla.org/show_bug.cgi?id=1887312 . I discovered inefficient/high-cost (from WebAssembly compilation point of view) shuffles. These shuffles generate more than 4 native instruction that also could read a constant from memory.

Example is:


    local.get 3
    local.get 3
    local.get 3
    i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    i8x16.extract_lane_u 0

I think it is the result of something like:

int run(unsigned char* a) {
  unsigned char m = 255;
  for (int i = 0; i < 10000; i++) if (a[i] < m) m = a[i];
  return m;
}

The shuffles in the example above do not really need to specify a particular lane (0), from logic point of view. If this lanes value will be chosen as different values instead of 0, the Wasm compiler could produce far more efficient instructions, but these may benefit specific CPU.

I opened an issue about masked lanes https://github.com/WebAssembly/flexible-vectors/issues/66 with more examples.

In SpiderMonkey, we already matching some shuffle patterns to better select CPU instructions (https://searchfox.org/mozilla-central/source/js/src/jit/ShuffleAnalysis.h), but these shuffles hard to match.

This is more RFC issue to find out internals of WebAssembly target and auto-vectorization. And is it possible to improve generation of the shuffles in Wasm when some lanes does not matter.

/cc @ppenzin @tlively

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-webassembly

Author: Yury Delendik (yurydelendik)

Currently working on issue https://bugzilla.mozilla.org/show_bug.cgi?id=1887312 . I discovered inefficient/high-cost (from WebAssembly compilation point of view) shuffles. These shuffles generate more than 4 native instruction that also could read a constant from memory. Example is: ```wat local.get 3 local.get 3 local.get 3 i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 i8x16.min_u local.tee 3 local.get 3 local.get 3 i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0 i8x16.min_u local.tee 3 local.get 3 local.get 3 i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i8x16.min_u local.tee 3 local.get 3 local.get 3 i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i8x16.min_u i8x16.extract_lane_u 0 ``` I think it is the result of something like: ```c int run(unsigned char* a) { unsigned char m = 255; for (int i = 0; i < 10000; i++) if (a[i] < m) m = a[i]; return m; } ``` The shuffles in the example above do not really need to specify a particular lane (0), from logic point of view. If this lanes value will be chosen as different values instead of 0, the Wasm compiler could produce far more efficient instructions, but these may benefit specific CPU. I opened an issue about masked lanes https://github.com/WebAssembly/flexible-vectors/issues/66 with more examples. In SpiderMonkey, we already matching some shuffle patterns to better select CPU instructions (https://searchfox.org/mozilla-central/source/js/src/jit/ShuffleAnalysis.h), but these shuffles hard to match. This is more RFC issue to find out internals of WebAssembly target and auto-vectorization. And is it possible to improve generation of the shuffles in Wasm when some lanes does not matter. /cc @ppenzin @tlively

tlively commented 3 months ago

It looks like we already do some optimization of unused/undefined shuffle indices to generate better patterns: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp#L2321-L2330. That's probably the best place to add more such optimizations.

ppenzin commented 2 months ago

Should we try to disable horizontal reductions for small lane sizes? A 8-bit reduction would result in 4 byte shuffles, which are still pretty slow.

tlively commented 2 months ago

Disable as in error out if the frontend tries to use them, or disable as in bail out to some other lowering scheme?

llvm / llvm-project

[WebAssembly] Masked lanes in auto-vectorization #101725