rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.09k stars 580 forks source link

CI: rseq-32 test intermittently does not finish #3118

Open Keno opened 2 years ago

Keno commented 2 years ago

E.g. https://buildkite.com/julialang/rr/builds/220#7dd3d2e4-d59f-4d14-a731-24ccf24ff78e

@DilumAluthge On timeout, can we SIGTERM all running processes and collect the TMPDIR, like we do for Julia CI?

DilumAluthge commented 2 years ago

On timeout, can we SIGTERM all running processes and collect the TMPDIR, like we do for Julia CI?

3121 (but I want to test that PR a bit more before merging it)

Keno commented 2 years ago

We saw several of these in #3117 - assertion is:

[FATAL /cache/build/exclusive-amdci1-0/julialang/rr/src/PerfCounters.cc:630:read_ticks()]
 (task 47041 (rec:65972) at time 355)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 18232731 ticks, expected no more than 18231119
Tail of trace dump:

Though the first order of business is probably to figure out why this isn't triggering the usual failure mechanism with its own internal timeout, but instead running into the buildkite timeout.

Keno commented 2 years ago

The end of debug.err is:

====== gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:47041' -ex 'set confirm off' -ex 'set height 0' -ex 'info registers' -ex 'thread apply all bt' -ex q /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_ey3nuQ/rr-test-rseq-vUumlAtPj/rseq-vUumlAtPj-0/mmap_hardlink_4_rseq-vUumlAtPj
 </dev/null 2>&1
GNU gdb (Debian 10.1-2) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_ey3nuQ/rr-test-rseq-vUumlAtPj/rseq-vUumlAtPj-0/mmap_hardlink_4_rseq-vUumlAtPj...

so it never actually ran those commands.

Interestingly the gdb_cmd that got tar'd up with the TMPDIR has a different port:

gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:33273' /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_ey3nuQ/rr-test-rseq-vUumlAtPj/rseq-vUumlAtPj-0/mmap_hardlink_4_rseq-vUumlAtPj

Not sure what happened there.

rocallahan commented 2 years ago

E.g. https://buildkite.com/julialang/rr/builds/220#7dd3d2e4-d59f-4d14-a731-24ccf24ff78e

Was this the correct link? There I see

2747/2761 Test #2448: rseq-32 ...........................................................   Passed   58.58 sec
Keno commented 2 years ago

Was this the correct link? There I see

I believe there it was rseq-32-no-syscallbuf.

The log I was looking at was from build #26 of the new run though, because we failed to capture the logs the first time around: https://buildkite.com/julialang/rr/builds/239#a19e09e6-5670-4a0b-9d71-3bf90f1dcbf8

In this case, the failing test is the non-32 variant, so it seems to be a general issue with rseq.

rocallahan commented 2 years ago
[FATAL /cache/build/exclusive-amdci1-0/julialang/rr/src/PerfCounters.cc:630:read_ticks()] 
 (task 47041 (rec:65972) at time 355)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 18232731 ticks, expected no more than 18231119
Tail of trace dump:

but it's not obvious to me why this would happen here. Event 355 is

{
  real_time:45370664.294208 global_time:355, event:`SCHED' tid:65972, ticks:728323976
rax:0x7fdd56fcc000 rbx:0x55968a9009b0 rcx:0x1 rdx:0x55968a903100 rsi:0x55968a901034 rdi:0x1 rbp:0x7fdd56cd5fd0 rsp:0x7fdd56cd5f90 r8:0x0 r9:0x7fdd57012040 r10:0x7fdd56dd6b80 r11:0x7fdd57013050 r12:0x55968a900150 r13:0x0 r14:0x0 r15:0x0 rip:0x55968a900732 eflags:0x297 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffff13 fs_base:0x7fdd56dd6b80 gs_base:0x0
}

That rip is at 1732 in main_child, here:

    1722:       e8 e4 fc ff ff          call   140b <do_section>
    1727:       83 45 fc 01             addl   $0x1,-0x4(%rbp)
    172b:       81 7d fc ff a2 e1 11    cmpl   $0x11e1a2ff,-0x4(%rbp)
    1732:       7e ee                   jle    1722 <main_child+0x1da>
rocallahan commented 2 years ago

i.e. this SCHED is not in the rseq critical section so I don't know why this would trigger anything abnormal.

rocallahan commented 2 years ago

I think f2faffc50d351b14718e47abffd4ac6f75b61ea3 should have fixed this.

rocallahan commented 2 years ago

Er no, not this assertion failure.