Open Keno opened 2 years ago
On timeout, can we SIGTERM all running processes and collect the TMPDIR, like we do for Julia CI?
We saw several of these in #3117 - assertion is:
[FATAL /cache/build/exclusive-amdci1-0/julialang/rr/src/PerfCounters.cc:630:read_ticks()]
(task 47041 (rec:65972) at time 355)
-> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 18232731 ticks, expected no more than 18231119
Tail of trace dump:
Though the first order of business is probably to figure out why this isn't triggering the usual failure mechanism with its own internal timeout, but instead running into the buildkite timeout.
The end of debug.err is:
====== gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:47041' -ex 'set confirm off' -ex 'set height 0' -ex 'info registers' -ex 'thread apply all bt' -ex q /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_ey3nuQ/rr-test-rseq-vUumlAtPj/rseq-vUumlAtPj-0/mmap_hardlink_4_rseq-vUumlAtPj
</dev/null 2>&1
GNU gdb (Debian 10.1-2) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_ey3nuQ/rr-test-rseq-vUumlAtPj/rseq-vUumlAtPj-0/mmap_hardlink_4_rseq-vUumlAtPj...
so it never actually ran those commands.
Interestingly the gdb_cmd that got tar'd up with the TMPDIR has a different port:
gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:33273' /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_ey3nuQ/rr-test-rseq-vUumlAtPj/rseq-vUumlAtPj-0/mmap_hardlink_4_rseq-vUumlAtPj
Not sure what happened there.
E.g. https://buildkite.com/julialang/rr/builds/220#7dd3d2e4-d59f-4d14-a731-24ccf24ff78e
Was this the correct link? There I see
2747/2761 Test #2448: rseq-32 ........................................................... Passed 58.58 sec
Was this the correct link? There I see
I believe there it was rseq-32-no-syscallbuf
.
The log I was looking at was from build #26 of the new run though, because we failed to capture the logs the first time around: https://buildkite.com/julialang/rr/builds/239#a19e09e6-5670-4a0b-9d71-3bf90f1dcbf8
In this case, the failing test is the non-32 variant, so it seems to be a general issue with rseq
.
[FATAL /cache/build/exclusive-amdci1-0/julialang/rr/src/PerfCounters.cc:630:read_ticks()]
(task 47041 (rec:65972) at time 355)
-> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 18232731 ticks, expected no more than 18231119
Tail of trace dump:
but it's not obvious to me why this would happen here. Event 355 is
{
real_time:45370664.294208 global_time:355, event:`SCHED' tid:65972, ticks:728323976
rax:0x7fdd56fcc000 rbx:0x55968a9009b0 rcx:0x1 rdx:0x55968a903100 rsi:0x55968a901034 rdi:0x1 rbp:0x7fdd56cd5fd0 rsp:0x7fdd56cd5f90 r8:0x0 r9:0x7fdd57012040 r10:0x7fdd56dd6b80 r11:0x7fdd57013050 r12:0x55968a900150 r13:0x0 r14:0x0 r15:0x0 rip:0x55968a900732 eflags:0x297 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffff13 fs_base:0x7fdd56dd6b80 gs_base:0x0
}
That rip
is at 1732 in main_child
, here:
1722: e8 e4 fc ff ff call 140b <do_section>
1727: 83 45 fc 01 addl $0x1,-0x4(%rbp)
172b: 81 7d fc ff a2 e1 11 cmpl $0x11e1a2ff,-0x4(%rbp)
1732: 7e ee jle 1722 <main_child+0x1da>
i.e. this SCHED
is not in the rseq critical section so I don't know why this would trigger anything abnormal.
I think f2faffc50d351b14718e47abffd4ac6f75b61ea3 should have fixed this.
Er no, not this assertion failure.
E.g. https://buildkite.com/julialang/rr/builds/220#7dd3d2e4-d59f-4d14-a731-24ccf24ff78e
@DilumAluthge On timeout, can we SIGTERM all running processes and collect the TMPDIR, like we do for Julia CI?