llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.42k stars 11.74k forks source link

Static array indices in function parameter declarations cause poor SIMD code-generation #59120

Open Maratyszcza opened 1 year ago

Maratyszcza commented 1 year ago

Static array indices in function parameter declarations poorly interfere with lowering of SIMD load-and-splat instructions, causing Clang/LLVM to generate separate full-vector load + vector-to-vector broadcast instructions. Here's the simplest example (repro in Compiler Explorer):

#include <stddef.h>

#include <wasm_simd128.h>

struct minmax_params {
  float min[2];
  float max[2];
};

v128_t f(const struct minmax_params params[1])
{
    // Generates v128.load64_splat as expected
    return wasm_v128_load64_splat(&params->min);
}

v128_t g(const struct minmax_params params[static 1])
{
    // Generates two instructions: v128.load + i8x16.shuffle
    return wasm_v128_load64_splat(&params->min);
}

The example above is for WebAssembly SIMD, but this issue is not specific to this backend and can be reproduced at least on ARM as well.

lukel97 commented 1 year ago

It looks like the use of static there makes the pointer argument dereferenceable:

define hidden <4 x i32> @f(ptr noundef %params) local_unnamed_addr {
entry:
  %params.val = load i64, ptr %params, align 1
  %vecinit.i = insertelement <2 x i64> undef, i64 %params.val, i64 0
  %vecinit2.i = shufflevector <2 x i64> %vecinit.i, <2 x i64> poison, <2 x i32> zeroinitializer
  %0 = bitcast <2 x i64> %vecinit2.i to <4 x i32>
  ret <4 x i32> %0
}

define hidden <4 x i32> @g(ptr nocapture noundef readonly align 4 dereferenceable(16) %params) local_unnamed_addr {
entry:
  %params.val = load i64, ptr %params, align 4
  %vecinit.i = insertelement <2 x i64> undef, i64 %params.val, i64 0
  %vecinit2.i = shufflevector <2 x i64> %vecinit.i, <2 x i64> poison, <2 x i32> zeroinitializer
  %0 = bitcast <2 x i64> %vecinit2.i to <4 x i32>
  ret <4 x i32> %0
}

Which in turn, causes vector-combine to kick in and vectorise the load in g so that it now looks like this:

define hidden <4 x i32> @g(ptr nocapture noundef readonly align 4 dereferenceable(16) %params) local_unnamed_addr {
entry:
  %0 = load <2 x i64>, ptr %params, align 4
  %vecinit.i = shufflevector <2 x i64> %0, <2 x i64> poison, <2 x i32> <i32 0, i32 undef>
  %vecinit2.i = shufflevector <2 x i64> %vecinit.i, <2 x i64> poison, <2 x i32> zeroinitializer
  %1 = bitcast <2 x i64> %vecinit2.i to <4 x i32>
  ret <4 x i32> %1
}

During dag-combine, the vector shuffle and vector insert in f get combined into a BUILD_VECTOR:

Initial selection DAG: %bb.0 'f:entry'
SelectionDAG has 12 nodes:
  t0: ch,glue = EntryToken
  t7: i64 = Constant<0>
            t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
          t5: i64,ch = load<(load (s64) from %ir.params, align 4)> t0, t2, undef:i32
        t8: v2i64 = insert_vector_elt undef:v2i64, t5, Constant:i32<0>
      t9: v2i64 = vector_shuffle<0,0> t8, undef:v2i64
    t10: v4i32 = bitcast t9
  t11: ch = WebAssemblyISD::RETURN t0, t10

...

Optimized lowered selection DAG: %bb.0 'f:entry'
SelectionDAG has 8 nodes:
  t0: ch,glue = EntryToken
    t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
  t5: i64,ch = load<(load (s64) from %ir.params, align 4)> t0, t2, undef:i32
      t13: v2i64 = BUILD_VECTOR t5, t5
    t10: v4i32 = bitcast t13
  t11: ch = WebAssemblyISD::RETURN t0, t10

From this part of DAGCombiner https://github.com/llvm/llvm-project/blob/06ca5c81a4d88d9c33018d5a33e38c449109e5d6/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L23000-L23011

And this is ultimately what the webassembly backend uses to select the v128.load64_splat. But because g doesn't use BUILD_VECTOR, it doesn't get selected. Its selection DAG looks like this:

Initial selection DAG: %bb.0 'g:entry'
SelectionDAG has 10 nodes:
  t0: ch,glue = EntryToken
  t3: i32 = Constant<0>
          t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
        t5: v2i64,ch = load<(dereferenceable load (s128) from %ir.params, align 4)> t0, t2, undef:i32
      t7: v2i64 = vector_shuffle<0,0> t5, undef:v2i64
    t8: v4i32 = bitcast t7
  t9: ch = WebAssemblyISD::RETURN t0, t8