herd / herdtools7

The Herd toolsuite to deal with .cat memory models (version 7.xx)
Other
215 stars 54 forks source link

More SVE instructions #856

Closed maranget closed 2 months ago

maranget commented 3 months ago

Add the AArch64 SVE instructions RDVL and ADDVL.

murzinv commented 3 months ago

LGTM :+1:

maranget commented 3 months ago

Hi @relokin, I am currently adding CNT and INC instructions, which look sufficient to write vector length independant code.

maranget commented 3 months ago

Hi @murzinv and @relokin. I'd be glad to have your opinion on the last two commits. Then, we can consider merging.

murzinv commented 3 months ago

which look sufficient to write vector length independant code

VL independent code usually done with loop partitioning, something like

AArch64 T
{
 uint32_t x[8] = {1,2,3,4,5,6,7,8};
 0:X0=x;
}
P0 ;
MOV X1,#0                      ;
MOV X2,#8                      ;
WHILELT P0.S,X1,X2             ;
L0:                            ;
LD1W {Z0.S},P0/Z,[X0,X1,LSL #2];
ADD Z0.S,Z0.S,Z0.S             ;
ST1W {Z0.S},P0,[X0,X1,LSL #2]  ;
INCW X1                        ;
WHILELT P0.S,X1,X2             ;
B.FIRST L0                     ;

forall x={2,4,6,8,10,12,14,16}

Which requires only INCW being implemented.

I'd be glad to have your opinion on the last two commits. Then, we can consider merging.

I'll have closer look after weekend

maranget commented 3 months ago

Hi @murzinv, I have added your test as V26. It works perfectly. Test V25 was inspired by compiler output...

murzinv commented 3 months ago

I could not find any issues with this PR other than few nipicks. LGTM :+1:

murzinv commented 3 months ago

Test V25 was inspired by compiler output...

Out of curiosity, could you share compiler input? :smile:

maranget commented 2 months ago

Test V25 was inspired by compiler output...

Out of curiosity, could you share compiler input? 😄

I'll look for it.

maranget commented 2 months ago

Thanks for your review @murzinv. Let us delay merge as I'd like to simplify history.

maranget commented 2 months ago

Test V25 was inspired by compiler output...

Out of curiosity, could you share compiler input? 😄

Compiler input:

#include <arm_sve.h>

void saxpy_c(int32_t *x, int32_t *y) {
        int i;
        for (i=0; i<1023; i++) {
                y[i] = x[i] + y[i];
        }
}

Original code is from this page. Compilation command is gcc -S -march=armv8-a+sve -O2 a.c on a MacBook M1.

% gcc --version
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
...
murzinv commented 2 months ago

Test V25 was inspired by compiler output...

Out of curiosity, could you share compiler input? 😄

Compiler input:

#include <arm_sve.h>

void saxpy_c(int32_t *x, int32_t *y) {
        int i;
        for (i=0; i<1023; i++) {
                y[i] = x[i] + y[i];
        }
}

Original code is from this page. Compilation command is gcc -S -march=armv8-a+sve -O2 a.c on a MacBook M1.

% gcc --version
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
...

Thanks! Interestingly that on that page their assembly output uses while* / inc* idiom, yet I can see that GCC can generate some form of it as well with upgrade to -O3. So, both V25 and V26 reflect real-world codegen.

maranget commented 2 months ago

Merged, thanks for teh review @murzinv.