llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.34k stars 12.13k forks source link

uchar4 vector element-by-element LUT handled worse than 3.4. #20380

Open llvmbot opened 10 years ago

llvmbot commented 10 years ago
Bugzilla Link 20006
Version trunk
OS Linux
Attachments Function demonstrating regression.
Reporter LLVM Bugzilla Contributor
CC @zygoloid,@stephenhines

Extended Description

Given a simple LUT loop which operates independently on each element of a uchar4, trunk presses ahead with a bunch of inserts and extracts through a temporary vector, along with a flurry of type conversion operations.

In contrast, Clang 3.4 appears to abandon the vector pretense at the outset and takes the scalar values directly from the source pointer -- completing the work in half the time.

GCC 4.8 appears to behave like Clang 3.4, but additionally packs the output into a single 32-bit scalar register before writing.

This affects both amd64 and ARM. Simply -Ofast to reproduce on amd64, and additionally -mfpu=neon for ARM to ensure that SIMD operations are available.

llvmbot commented 8 years ago

This appears to be fixed for 32-bit ARM, but this code (copied from attachment):

typedef uint8_t uchar4 __attribute__((__vector_size__(4)));

void dut0(uchar4 * restrict out, uchar4 const * restrict in, int count, uint8_t const * restrict tab) {
  uint8_t const *t0 = tab, *t1 = tab + 256, *t2 = tab + 512, *t3 = tab + 768;
  while (--count >= 0) {
    uchar4 tmp = *in++;
    *out++ = (uchar4){ t0[tmp[0]], t1[tmp[1]], t2[tmp[2]], t3[tmp[3]] };
  }
}

Compiled:

clang --target=aarch64-linux-gnu -Ofast -S foo.c -o-

Gives this loop body:

    ldrb    w12, [x1]
    ldrb    w13, [x1, #​1]
    ldrb    w14, [x1, #​2]
    ldrb    w15, [x1, #​3]
    ins     v0.h[0], w12
    ins     v0.h[1], w13
    ins     v0.h[2], w14
    ins     v0.h[3], w15
    umov    w12, v0.h[0]
    umov    w13, v0.h[1]
    umov    w14, v0.h[2]
    umov    w15, v0.h[3]
    and     x12, x12, #​0xff
    and     x13, x13, #​0xff
    and     x14, x14, #​0xff
    and     x15, x15, #​0xff
    ldrb    w15, [x10, x15]
    ldrb    w14, [x9, x14]
    ldrb    w13, [x8, x13]
    ldrb    w12, [x3, x12]
    add     x1, x1, #​4              // =4
    strb    w15, [x0, #​3]
    strb    w14, [x0, #​2]
    strb    w13, [x0, #​1]
    strb    w12, [x0], #​4

Twelve of those instructions could simply be deleted.