llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.92k stars 11.53k forks source link

[WebAssembly] Masked lanes in auto-vectorization #101725

Open yurydelendik opened 1 month ago

yurydelendik commented 1 month ago

Currently working on issue https://bugzilla.mozilla.org/show_bug.cgi?id=1887312 . I discovered inefficient/high-cost (from WebAssembly compilation point of view) shuffles. These shuffles generate more than 4 native instruction that also could read a constant from memory.

Example is:


    local.get 3
    local.get 3
    local.get 3
    i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    i8x16.extract_lane_u 0

I think it is the result of something like:

int run(unsigned char* a) {
  unsigned char m = 255;
  for (int i = 0; i < 10000; i++) if (a[i] < m) m = a[i];
  return m;
}

The shuffles in the example above do not really need to specify a particular lane (0), from logic point of view. If this lanes value will be chosen as different values instead of 0, the Wasm compiler could produce far more efficient instructions, but these may benefit specific CPU.

I opened an issue about masked lanes https://github.com/WebAssembly/flexible-vectors/issues/66 with more examples.

In SpiderMonkey, we already matching some shuffle patterns to better select CPU instructions (https://searchfox.org/mozilla-central/source/js/src/jit/ShuffleAnalysis.h), but these shuffles hard to match.

This is more RFC issue to find out internals of WebAssembly target and auto-vectorization. And is it possible to improve generation of the shuffles in Wasm when some lanes does not matter.

/cc @ppenzin @tlively

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-webassembly

Author: Yury Delendik (yurydelendik)

Currently working on issue https://bugzilla.mozilla.org/show_bug.cgi?id=1887312 . I discovered inefficient/high-cost (from WebAssembly compilation point of view) shuffles. These shuffles generate more than 4 native instruction that also could read a constant from memory. Example is: ```wat local.get 3 local.get 3 local.get 3 i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 i8x16.min_u local.tee 3 local.get 3 local.get 3 i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0 i8x16.min_u local.tee 3 local.get 3 local.get 3 i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i8x16.min_u local.tee 3 local.get 3 local.get 3 i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i8x16.min_u i8x16.extract_lane_u 0 ``` I think it is the result of something like: ```c int run(unsigned char* a) { unsigned char m = 255; for (int i = 0; i < 10000; i++) if (a[i] < m) m = a[i]; return m; } ``` The shuffles in the example above do not really need to specify a particular lane (0), from logic point of view. If this lanes value will be chosen as different values instead of 0, the Wasm compiler could produce far more efficient instructions, but these may benefit specific CPU. I opened an issue about masked lanes https://github.com/WebAssembly/flexible-vectors/issues/66 with more examples. In SpiderMonkey, we already matching some shuffle patterns to better select CPU instructions (https://searchfox.org/mozilla-central/source/js/src/jit/ShuffleAnalysis.h), but these shuffles hard to match. This is more RFC issue to find out internals of WebAssembly target and auto-vectorization. And is it possible to improve generation of the shuffles in Wasm when some lanes does not matter. /cc @ppenzin @tlively
tlively commented 1 month ago

It looks like we already do some optimization of unused/undefined shuffle indices to generate better patterns: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp#L2321-L2330. That's probably the best place to add more such optimizations.

ppenzin commented 1 week ago

Should we try to disable horizontal reductions for small lane sizes? A 8-bit reduction would result in 4 byte shuffles, which are still pretty slow.

tlively commented 1 week ago

Disable as in error out if the frontend tries to use them, or disable as in bail out to some other lowering scheme?