I have verified on my machine (sandybridge) that the scalar algorithm is significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.
DAGCombiner is too late to make a decision about this, we need a TTI cost model driven scalarization stage - not sure if this is the kind of thing we'd want in SLP?
Extended Description
With few vector lanes and a large lane-width (i.e. v2i64), LLVM's ctlz intrinsics can actually be slower than a scalar implementation.
The following code produces a complicated divide-and-conquer algorithm (with
-O3 -mcpu=sandybridge
):Which produces the following assembly:
The scalar algorithm is merely something like:
compare: https://godbolt.org/z/ZdW3Yj
I have verified on my machine (sandybridge) that the scalar algorithm is significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.