Open llvmbot opened 10 years ago
This appears to be fixed for 32-bit ARM, but this code (copied from attachment):
typedef uint8_t uchar4 __attribute__((__vector_size__(4)));
void dut0(uchar4 * restrict out, uchar4 const * restrict in, int count, uint8_t const * restrict tab) {
uint8_t const *t0 = tab, *t1 = tab + 256, *t2 = tab + 512, *t3 = tab + 768;
while (--count >= 0) {
uchar4 tmp = *in++;
*out++ = (uchar4){ t0[tmp[0]], t1[tmp[1]], t2[tmp[2]], t3[tmp[3]] };
}
}
Compiled:
clang --target=aarch64-linux-gnu -Ofast -S foo.c -o-
Gives this loop body:
ldrb w12, [x1]
ldrb w13, [x1, #​1]
ldrb w14, [x1, #​2]
ldrb w15, [x1, #​3]
ins v0.h[0], w12
ins v0.h[1], w13
ins v0.h[2], w14
ins v0.h[3], w15
umov w12, v0.h[0]
umov w13, v0.h[1]
umov w14, v0.h[2]
umov w15, v0.h[3]
and x12, x12, #​0xff
and x13, x13, #​0xff
and x14, x14, #​0xff
and x15, x15, #​0xff
ldrb w15, [x10, x15]
ldrb w14, [x9, x14]
ldrb w13, [x8, x13]
ldrb w12, [x3, x12]
add x1, x1, #​4 // =4
strb w15, [x0, #​3]
strb w14, [x0, #​2]
strb w13, [x0, #​1]
strb w12, [x0], #​4
Twelve of those instructions could simply be deleted.
Extended Description
Given a simple LUT loop which operates independently on each element of a uchar4, trunk presses ahead with a bunch of inserts and extracts through a temporary vector, along with a flurry of type conversion operations.
In contrast, Clang 3.4 appears to abandon the vector pretense at the outset and takes the scalar values directly from the source pointer -- completing the work in half the time.
GCC 4.8 appears to behave like Clang 3.4, but additionally packs the output into a single 32-bit scalar register before writing.
This affects both amd64 and ARM. Simply -Ofast to reproduce on amd64, and additionally -mfpu=neon for ARM to ensure that SIMD operations are available.