uchar4 vector element-by-element LUT handled worse than 3.4.

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

Other

29.34k stars 12.13k forks source link


Bugzilla Link	20006
Version	trunk
OS	Linux
Attachments	Function demonstrating regression.
Reporter	LLVM Bugzilla Contributor
CC	@zygoloid,@stephenhines

Bugzilla Link

20006

Version

trunk

Linux

Attachments

Function demonstrating regression.

Reporter

LLVM Bugzilla Contributor

@zygoloid,@stephenhines

Extended Description

Given a simple LUT loop which operates independently on each element of a uchar4, trunk presses ahead with a bunch of inserts and extracts through a temporary vector, along with a flurry of type conversion operations.

In contrast, Clang 3.4 appears to abandon the vector pretense at the outset and takes the scalar values directly from the source pointer -- completing the work in half the time.

GCC 4.8 appears to behave like Clang 3.4, but additionally packs the output into a single 32-bit scalar register before writing.

This affects both amd64 and ARM. Simply -Ofast to reproduce on amd64, and additionally -mfpu=neon for ARM to ensure that SIMD operations are available.

This appears to be fixed for 32-bit ARM, but this code (copied from attachment):

typedef uint8_t uchar4 __attribute__((__vector_size__(4)));

void dut0(uchar4 * restrict out, uchar4 const * restrict in, int count, uint8_t const * restrict tab) {
  uint8_t const *t0 = tab, *t1 = tab + 256, *t2 = tab + 512, *t3 = tab + 768;
  while (--count >= 0) {
    uchar4 tmp = *in++;
    *out++ = (uchar4){ t0[tmp[0]], t1[tmp[1]], t2[tmp[2]], t3[tmp[3]] };
  }
}

Compiled:

clang --target=aarch64-linux-gnu -Ofast -S foo.c -o-

Gives this loop body:

    ldrb    w12, [x1]
    ldrb    w13, [x1, #&#8203;1]
    ldrb    w14, [x1, #&#8203;2]
    ldrb    w15, [x1, #&#8203;3]
    ins     v0.h[0], w12
    ins     v0.h[1], w13
    ins     v0.h[2], w14
    ins     v0.h[3], w15
    umov    w12, v0.h[0]
    umov    w13, v0.h[1]
    umov    w14, v0.h[2]
    umov    w15, v0.h[3]
    and     x12, x12, #&#8203;0xff
    and     x13, x13, #&#8203;0xff
    and     x14, x14, #&#8203;0xff
    and     x15, x15, #&#8203;0xff
    ldrb    w15, [x10, x15]
    ldrb    w14, [x9, x14]
    ldrb    w13, [x8, x13]
    ldrb    w12, [x3, x12]
    add     x1, x1, #&#8203;4              // =4
    strb    w15, [x0, #&#8203;3]
    strb    w14, [x0, #&#8203;2]
    strb    w13, [x0, #&#8203;1]
    strb    w12, [x0], #&#8203;4

Twelve of those instructions could simply be deleted.

llvm / llvm-project

uchar4 vector element-by-element LUT handled worse than 3.4. #20380

Extended Description