Open Maratyszcza opened 1 year ago
It looks like the use of static there makes the pointer argument dereferenceable
:
define hidden <4 x i32> @f(ptr noundef %params) local_unnamed_addr {
entry:
%params.val = load i64, ptr %params, align 1
%vecinit.i = insertelement <2 x i64> undef, i64 %params.val, i64 0
%vecinit2.i = shufflevector <2 x i64> %vecinit.i, <2 x i64> poison, <2 x i32> zeroinitializer
%0 = bitcast <2 x i64> %vecinit2.i to <4 x i32>
ret <4 x i32> %0
}
define hidden <4 x i32> @g(ptr nocapture noundef readonly align 4 dereferenceable(16) %params) local_unnamed_addr {
entry:
%params.val = load i64, ptr %params, align 4
%vecinit.i = insertelement <2 x i64> undef, i64 %params.val, i64 0
%vecinit2.i = shufflevector <2 x i64> %vecinit.i, <2 x i64> poison, <2 x i32> zeroinitializer
%0 = bitcast <2 x i64> %vecinit2.i to <4 x i32>
ret <4 x i32> %0
}
Which in turn, causes vector-combine
to kick in and vectorise the load in g
so that it now looks like this:
define hidden <4 x i32> @g(ptr nocapture noundef readonly align 4 dereferenceable(16) %params) local_unnamed_addr {
entry:
%0 = load <2 x i64>, ptr %params, align 4
%vecinit.i = shufflevector <2 x i64> %0, <2 x i64> poison, <2 x i32> <i32 0, i32 undef>
%vecinit2.i = shufflevector <2 x i64> %vecinit.i, <2 x i64> poison, <2 x i32> zeroinitializer
%1 = bitcast <2 x i64> %vecinit2.i to <4 x i32>
ret <4 x i32> %1
}
During dag-combine, the vector shuffle and vector insert in f
get combined into a BUILD_VECTOR
:
Initial selection DAG: %bb.0 'f:entry'
SelectionDAG has 12 nodes:
t0: ch,glue = EntryToken
t7: i64 = Constant<0>
t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
t5: i64,ch = load<(load (s64) from %ir.params, align 4)> t0, t2, undef:i32
t8: v2i64 = insert_vector_elt undef:v2i64, t5, Constant:i32<0>
t9: v2i64 = vector_shuffle<0,0> t8, undef:v2i64
t10: v4i32 = bitcast t9
t11: ch = WebAssemblyISD::RETURN t0, t10
...
Optimized lowered selection DAG: %bb.0 'f:entry'
SelectionDAG has 8 nodes:
t0: ch,glue = EntryToken
t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
t5: i64,ch = load<(load (s64) from %ir.params, align 4)> t0, t2, undef:i32
t13: v2i64 = BUILD_VECTOR t5, t5
t10: v4i32 = bitcast t13
t11: ch = WebAssemblyISD::RETURN t0, t10
From this part of DAGCombiner https://github.com/llvm/llvm-project/blob/06ca5c81a4d88d9c33018d5a33e38c449109e5d6/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L23000-L23011
And this is ultimately what the webassembly backend uses to select the v128.load64_splat
. But because g
doesn't use BUILD_VECTOR
, it doesn't get selected. Its selection DAG looks like this:
Initial selection DAG: %bb.0 'g:entry'
SelectionDAG has 10 nodes:
t0: ch,glue = EntryToken
t3: i32 = Constant<0>
t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
t5: v2i64,ch = load<(dereferenceable load (s128) from %ir.params, align 4)> t0, t2, undef:i32
t7: v2i64 = vector_shuffle<0,0> t5, undef:v2i64
t8: v4i32 = bitcast t7
t9: ch = WebAssemblyISD::RETURN t0, t8
Static array indices in function parameter declarations poorly interfere with lowering of SIMD load-and-splat instructions, causing Clang/LLVM to generate separate full-vector load + vector-to-vector broadcast instructions. Here's the simplest example (repro in Compiler Explorer):
The example above is for WebAssembly SIMD, but this issue is not specific to this backend and can be reproduced at least on ARM as well.