loongson-community / discussions

Cross-community issue tracker & discussions / 跨社区工单追踪 & 讨论场所
9 stars 0 forks source link

Heisenbug likely in kernel: GCC tests builtin-fp-int-inexact.c and builtin-fp-int-inexact-c2x.c intermittently fail #7

Closed xry111 closed 7 months ago

xry111 commented 9 months ago

When bootstrapping and regtesting GCC on Gentoo, the tests builtin-fp-int-inexact.c and builtin-fp-int-inexact-c2x.c intermittently fail.

The log is not very helpful:

PASS: gcc.dg/torture/builtin-fp-int-inexact.c   -O3 -g  (test for excess errors)
Setting LD_LIBRARY_PATH to :/home/xry111/git-repos/gcc-build/gcc:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/./libatomic/.libs::/home/xry111/git-repos/gcc-build/gcc:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/./libatomic/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libstdc++-v3/src/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libsanitizer/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libvtv/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libssp/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libgomp/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libitm/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libatomic/.libs:/home/xry111/git-repos/gcc-build/./gcc:/home/xry111/git-repos/gcc-build/./prev-gcc:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libstdc++-v3/src/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libsanitizer/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libvtv/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libssp/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libgomp/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libitm/.libs:/home/xry111/git-repos/gcc-build/loongarch64-unknown-linux-gnu/libatomic/.libs:/home/xry111/git-repos/gcc-build/./gcc:/home/xry111/git-repos/gcc-build/./prev-gcc
Execution timeout is: 300
spawn [open ...]
FAIL: gcc.dg/torture/builtin-fp-int-inexact.c   -O3 -g  execution test

I don't believe this is a GCC bug (with 99% confidence), so creating an issue here instead of via GCC Bugzilla.

xen0n commented 9 months ago

Edited the issue title for you (it's named after a different scientist in English expression, interestingly ;-)

xry111 commented 9 months ago

The coredump contains FCSR = 0x10000, indicating an inaccurate exception while this test is designed to abort on any inaccurate exception. I'm wondering if the kernel fails to restore FCSR during context switch with some rare condition...

xry111 commented 9 months ago

There seems some general stability issue on the Gentoo dev machine. Today I've seen another non-reproducible test failure in libstdc++ which should be completely unrelated to FP, and an non-reproducible build failure where the assembler fails to assemble the generated .s file in /tmp (the file seems truncated for some reason).

xry111 commented 9 months ago

There seems some general stability issue on the Gentoo dev machine. Today I've seen another non-reproducible test failure in libstdc++ which should be completely unrelated to FP, and an non-reproducible build failure where the assembler fails to assemble the generated .s file in /tmp (the file seems truncated for some reason).

@xen0n could we try a latest stable release of the kernel? Currently we have "6.6.0-rc3-next-20230928" which is both some sort of outdated and an unstable reelase.

xry111 commented 8 months ago

Still happening with 6.6.7-gentoo-dist.

xry111 commented 7 months ago

Happy new year folks.

I can reproduce the builtin-fp-int-inexact-c2x.c failure with a high probability by running a loop

while true; do echo xx; ./builtin-fp-int-inexact-c2x.exe || break; done

when "some parts" (I assume they perform some FP rounding operations) of the GCC test suite are running in the background.

The loop will be broken in seconds (otherwise it means the background processes are not performing enough FP rounding operations, I guess). Sometimes there are only about ten lines of "xx" outputted.

FWIW the kernel is 6.7.0-rc7+.

xry111 commented 7 months ago

How to reproduce:

$ cat noise.c
int main()
{
    while (1) {
        volatile float x = 114.514;
        volatile int y = x;
    }
}
$ cc noise.c -O2 -o noise
$ cat measure.c
#define _GNU_SOURCE
#include <fenv.h>
#include <stdio.h>

int main()
{
    return fetestexcept(FE_INEXACT) || fetestexcept(FE_INEXACT);
}
$ cc measure.c -O2 -o measure -lm
$ ./noise & while ./measure; do echo ok; done

The while loop will finally stop, indicating FE_INEXACT has came out from nowhere. After that remember to run killall noise!

An interesting aspect: if I only call fetestexcept(FE_INEXACT) once in noise.c I cannot reproduce the issue at all.

xry111 commented 7 months ago

How to reproduce:

$ cat noise.c
int main()
{
  while (1) {
      volatile float x = 114.514;
      volatile int y = x;
  }
}
$ cc noise.c -O2 -o noise
$ cat measure.c
#define _GNU_SOURCE
#include <fenv.h>
#include <stdio.h>

int main()
{
  return fetestexcept(FE_INEXACT) || fetestexcept(FE_INEXACT);
}
$ cc measure.c -O2 -o measure -lm
$ ./noise & while ./measure; do echo ok; done

The while loop will finally stop, indicating FE_INEXACT has came out from nowhere. After that remember to run killall noise!

An interesting aspect: if I only call fetestexcept(FE_INEXACT) once in noise.c I cannot reproduce the issue at all.

And phew, it still reproduces even w/o "noise" running with some good (or bad?) luck.

xry111 commented 7 months ago

To rule out Glibc I've rewritten measure with assembly:

.globl _start
_start:
    .align      4
    movfcsr2gr  $a0, $fcsr0
    bstrpick.w  $a0, $a0, 16, 16
    li.w        $a7, 93
    syscall     0

and the issue still reproduces.

xry111 commented 7 months ago

Whoa. This seems happening when execve'ing an executable from a process where FCSR is already dirty. When I run the loop

while ./measure; do; done

(measure assembled from the assembly above) in a "fresh" shell with FCSR = 0 (proven via gdb attach) the issue does not reproduce. But when I run the loop in a shell with FCSR=0x10000 (i.e. an INE already happened in this shell; again proven via gdb attach) it reproduces easily. From a clean shell it also reproduces with a wrapper messing up the FCSR first then execve the "measure" program:

#include <unistd.h>

int main()
{
    volatile double x = 114.514;
    volatile int y = x;
    execl("./measure", "measure", NULL);
    __builtin_abort();
}
xry111 commented 7 months ago

It seems we are relying on SET_PERSONALITY2 to clear FCSR on execve. I'd say this is highly suspicious and it seems no other architectures (even MIPS) clears only MIPS and LoongArch clear FCSR on execve this way.

xry111 commented 7 months ago

I've made this:

diff --git a/arch/loongarch/kernel/process.c b/arch/loongarch/kernel/process.c
index 767d94cce0de..caed58770650 100644
--- a/arch/loongarch/kernel/process.c
+++ b/arch/loongarch/kernel/process.c
@@ -92,6 +92,7 @@ void start_thread(struct pt_regs *regs, unsigned long pc, unsigned long sp)
        clear_used_math();
        regs->csr_era = pc;
        regs->regs[3] = sp;
+       current->thread.fpu.fcsr = 0;
 }

 void flush_thread(void)

With this line echo $((1.0/3)); while ./measure; do ; done survived five minutes, but without this line the command just stopped in seconds.

I'm not sure if this is a proper fix, or just reducing the probability of a dirty FCSR after execve, or just papering over the real issue. But to me we have enough evidence to say this is a kernel bug.

xry111 commented 7 months ago

http://lore.kernel.org/loongarch/20240101172143.14530-2-xry111@xry111.site/

xry111 commented 7 months ago

It seems we are relying on SET_PERSONALITY2 to clear FCSR on execve. I'd say this is highly suspicious and it seems ~no other architectures (even MIPS) clears~ only MIPS and LoongArch clear FCSR on execve this way.

As I've guessed, MIPS is buggy too :(.

xry111 commented 7 months ago

Fixed for mainline kernel.