RISCV64 miscompile at -O2

patrick-rivos commented 9 months ago

Testcase:

int a;
int b[8][6];
int c;
int main() {
  signed char f = 0;
  for (signed char g = 0; g <= 5; g++) {
    f = b[g + 2][0] % 8;
    b[0][g] = f + (long)1;
  }
  for (int d = 0; d < 6; d++)
    a = b[c][d];
  if (a == 1)
    return 0;
  else
    return 1;
}

Commands:

> /scratch/tc-testing/llvm-feb-5/build/bin/clang -O2 -march=rv64gcv red.c -o user-config.out
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true /scratch/tc-testing/llvm-feb-5/build/bin/qemu-riscv64 user-config.out
> echo $?
1
> /scratch/tc-testing/llvm-feb-5/build/bin/clang -O1 -march=rv64gcv red.c -o user-config.out
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true /scratch/tc-testing/llvm-feb-5/build/bin/qemu-riscv64 user-config.out
> echo $?
0

Bugpoint/asm: bugpoint-and-asm.zip

opt-bisect-limit points to InferAlignmentPass

Discovered/tested using a7bc9cb6ffa91ff0ebabc45c0c7263c7c2c3a4de (not bisected) Found using fuzzer.

preames commented 9 months ago

I think this is some kind of bug in our gep lowering.

This: %0 = tail call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> <ptr getelementptr inbounds ([8 x [6 x i32]], ptr @b, i64 0, i64 2), ptr getelementptr inbounds ([8 x [6 x i32]], ptr @b, i64 0, i64 3), ptr getelementptr inbounds ([8 x [6 x i32]], ptr @b, i64 0, i64 4), ptr getelementptr inbounds ([8 x [6 x i32]], ptr @b, i64 0, i64 5)>, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison), !tbaa !7

Is becoming this after initial SDAG construction:

    t3: v4i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<-1>
      t7: i64 = add nuw GlobalAddress:i64<ptr @b> 0, Constant:i64<48>
      t9: i64 = add nuw GlobalAddress:i64<ptr @b> 0, Constant:i64<72>
      t11: i64 = add nuw GlobalAddress:i64<ptr @b> 0, Constant:i64<96>
      t13: i64 = add nuw GlobalAddress:i64<ptr @b> 0, Constant:i64<120>
    t14: v4i64 = BUILD_VECTOR t7, t9, t11, t13
  t16: v4i32,ch = masked_gather<(load unknown-size, align 4, !tbaa !7), signed unscaled offset> t0, undef:v4i32, t3, Constant:i64<0>

Unless I'm missing something here, those offsets are wrong. ptr getelementptr inbounds ([8 x [6 x i32]], ptr @b, i64 0, i64 2) should be @b + 8 bytes not @b + 48 bytes. It almost seems like we're scaling by the inner array size (4 x 6 = 24) here.

I'm a bit distrustful of my own claim here. Am I misreading something? This seems too badly broken to have survived in tree for long?

topperc commented 9 months ago

I think 48 is correct. We have a global variable pointing to an array of arrays. The first 0 index has to step over the global variable. https://llvm.org/docs/GetElementPtr.html#why-is-the-extra-0-index-required The second index indexes into the outer array which has [6 x i32] elements.

preames commented 9 months ago

I think 48 is correct. We have a global variable pointing to an array of arrays. The first 0 index has to step over the global variable. https://llvm.org/docs/GetElementPtr.html#why-is-the-extra-0-index-required The second index indexes into the outer array which has [6 x i32] elements.

You're right, I was forgetting we had to index "in to" the global. What confused me was that the pointer type being returned was a pointer to the inner array, and there's an implicit [0] from the source code. Another case where pointer types just confuse things. :)

Back to square zero, my analysis above is completely off.

topperc commented 9 months ago

I'm unable to reproduce this with the prebuilt copies of qemu I have around. So I don't know if I'm doing something wrong or there's some qemu bug we have fixed.

patrick-rivos commented 9 months ago

I used qemu v8.1.2. I can try again with a more tip-of-tree QEMU.

patrick-rivos commented 9 months ago

I tried with both v8.2.1 and tip-of-tree QEMU and could reproduce the failure on both.

topperc commented 9 months ago

I can't reproduce the error, but I've looked at the opt-bisect right. It looks like the only difference in the final assembly is the p2align directive on b?

patrick-rivos commented 9 months ago

Yep that's what I'm seeing too. Fails with .p2align 2, 0x0 but passes with .p2align 4, 0x0

> /scratch/tc-testing/llvm-feb-5/build/bin/clang exit-code-0-SROAPass.S -o user-config.out
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true /scratch/tc-testing/llvm-feb-5/build/bin/qemu-riscv64 user-config.out
> echo $?
0
> /scratch/tc-testing/llvm-feb-5/build/bin/clang exit-code-1-InferAlignmentPass.S -o user-config.out
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true /scratch/tc-testing/llvm-feb-5/build/bin/qemu-riscv64 user-config.out
> echo $?
1

asm.zip

preames commented 9 months ago

An interesting datapoint

$ CC -fuse-ld=lld repo-O2-patched.s -o repo-O2-patched
$ qemu-riscv64 repo-O2-patched
$ echo $?
0
$CC -fuse-ld=ld repo-O2-patched.s -o repo-O2-patched
$ qemu-riscv64 repo-O2-patched
$ echo $?
1

repo-O2-patched.s is the .s output of O2 per the above reproducer with the alignment of b reduced to 4.

The LLD in question is a somewhat recently build upstream LLD copy. The LD is some downstream fork of uncertain heritage. It's entirely possible it's a downstream LD bug.

Here's another data point.

repo-O0-ld
0
repo-O0-lld
0
repo-O1-ld
0
repo-O1-lld
0
repo-O2-ld
1
repo-O2-lld
0
repo-O3-ld
1
repo-O3-lld
0

repo.sh is:

for LEVEL in "O0" "O1" "O2" "O3"
do
    for LINKER in "ld" "lld"
    do
        STEM="repo-$LEVEL-$LINKER"
        echo $STEM
        $CC -fuse-ld=$LINKER -$LEVEL repo.c -o $STEM
        $CC -$LEVEL repo.c -o $STEM.s -S
        $CC -$LEVEL repo.c -o $STEM.ll -S -emit-llvm
        qemu-riscv64 $STEM
        echo $?
    done
done
exit 0

That paints a pretty clear story that it's only the high alignment on LD which causes the wrong result here. This could be a linker bug (possibly downstream!) or an odd interaction with relocation semantics in some non-obvious (to me) way.

llvm / llvm-project

RISCV64 miscompile at -O2 #80792