aarch64: SLP of 4 byte add should not extend/then add using half-word and then truncate, just do the add

pinskia commented 9 months ago

Take:

void f(unsigned char * __restrict a, unsigned char * __restrict b)
{
 a[0] += b[0];
 a[1] += b[1];
 a[2] += b[2];
 a[3] += b[3];
}

Right now aarch64 produces:

f: 
        ldr     s0, [x1]
        ldr     s1, [x0]
        uaddl   v0.8h, v1.8b, v0.8b
        xtn     v0.8b, v0.8h
        str     s0, [x0]
        ret

But it could just use add directly without an extra truncate at the end like:

f:
        ldr     s0, [x1]
        ldr     s1, [x0]
        add   v0.8b, v1.8b, v0.8b
        str     s0, [x0]
        ret

Reductions, multiply, popcount, compares and many more have a similar issue of wanting to extend to half-word and then truncating at the end. Instead of treating the other 4 entries in the vector as not needed. I found this while implmenting this similar thing in the GCC backend and I thought I would file this.

llvmbot commented 9 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Andrew Pinski (pinskia)

Take: ``` void f(unsigned char * __restrict a, unsigned char * __restrict b) { a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; a[3] += b[3]; } ``` Right now aarch64 produces: ``` f: ldr s0, [x1] ldr s1, [x0] uaddl v0.8h, v1.8b, v0.8b xtn v0.8b, v0.8h str s0, [x0] ret ``` But it could just use add directly without an extra truncate at the end like: ``` f: ldr s0, [x1] ldr s1, [x0] add v0.8b, v1.8b, v0.8b str s0, [x0] ret ``` Reductions, multiply, popcount, compares and many more have a similar issue of wanting to extend to half-word and then truncating at the end. Instead of treating the other 4 entries in the vector as not needed. I found this while implmenting this similar thing in the GCC backend and I thought I would file this.

davemgreen commented 8 months ago

Thanks for the bug report. This is because we widen all small integer vectors in DAG. It can make sense in some cases, but leads to worse performance in others. I'm hoping that we can at least make global-isel pick the better size.

llvm / llvm-project

aarch64: SLP of 4 byte add should not extend/then add using half-word and then truncate, just do the add #81374