Closed rocallahan closed 1 year ago
I do not see any errors in record.err
other than timeout. The trace looks complete; it ends after rr_detach_teleport
:
{
real_time:3351.467504 global_time:255, event:`SYSCALL: rrcall_detach_teleport' (state:ENTERING_SYSCALL) tid:376689, ticks:131445
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffd40 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x564618afcb70 r13:0x7ffd9d5d7390 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3f1 fs_base:0x7f3de3ce6740 gs_base:0x0
}
{
real_time:3351.467796 global_time:256, event:`EXIT' tid:376689, ticks:131445
}
and rr is waiting for the detached task to finish:
Thread 1 (Thread 0x7f05c50c4740 (LWP 373610)):
#0 0x00007f05c51abd39 in __waitid (idtype=P_ALL, id=4294967295, infop=0x7ffd9a023d90, options=1073741830) at ../sysdeps/unix/sysv/linux/waitid.c:29
#1 0x0000561d0a57f98e in rr::WaitState::do_wait (this=0x561d0a8cdfa0 <rr::wait_state()::static_state>, tid=-1, consume=true, type=6, block_seconds=315360000, status=...) at /home/ubuntu/rr/src/WaitManager.cc:58
#2 0x0000561d0a57fe2f in rr::WaitState::wait (this=0x561d0a8cdfa0 <rr::wait_state()::static_state>, options=..., type=6) at /home/ubuntu/rr/src/WaitManager.cc:110
#3 0x0000561d0a580591 in rr::WaitManager::wait_stop_or_exit (options=...) at /home/ubuntu/rr/src/WaitManager.cc:214
#4 0x0000561d0a4e73fb in rr::wait_any (tid=@0x7ffd9a024060: 385000, status=..., timeout=0) at /home/ubuntu/rr/src/Scheduler.cc:528
#5 0x0000561d0a4e8fae in rr::Scheduler::reschedule (this=0x561d0ae35220, switchable=rr::ALLOW_SWITCH) at /home/ubuntu/rr/src/Scheduler.cc:870
#6 0x0000561d0a3f07a8 in rr::RecordSession::record_step (this=0x561d0ae34fd0) at /home/ubuntu/rr/src/RecordSession.cc:2497
#7 0x0000561d0a3e3085 in rr::record (args=..., flags=...) at /home/ubuntu/rr/src/RecordCommand.cc:697
#8 0x0000561d0a3e3e71 in rr::RecordCommand::run (this=0x561d0a8cc0b0 <rr::RecordCommand::singleton>, args=...) at /home/ubuntu/rr/src/RecordCommand.cc:860
#9 0x0000561d0a3a2fe8 in main (argc=11, argv=0x7ffd9a024568) at /home/ubuntu/rr/src/main.cc:278
I don't see how do_detach_teleport
can complete successfully without the new task leaving its SIGSTOP
. We have to do an exec in it for example.
Logging shows that the detached task is in the signal-stop after do_detach_teleport
has completed new_t->dup_from(t)
.
Sending SIGCONT
to the detached task after PTRACE_DETACH
ing it fixes the problem. But I'd like to know how the task manages to be in the stopped state after we've done all the remote syscalls to set it up.
OK this is 100% reproducible if I insert sleep(1);
in Task::spawn
before we call ptrace_seize()
, so that the task SIGSTOP
s itself before we take ptrace control, i.e. it goes into group-stop state.
I still expect the PTRACE_SYSCALL
etc that we do to exec the spawned task would clear the group-stop, but apparently it doesn't.
Yeah, it's pretty clear SIGNAL_STOP_STOPPED
remains pending in the task until a SIGCONT
happens. The group stop is just not effective while the task is ptraced.
These happen in CI, e.g. https://buildkite.com/julialang/rr/builds/1367#018a6c18-26e2-4496-8eb5-3a7c0e12dcc8
I believe I can also reproduce the same or a similar issue in a test VM. E.g. I run
and usually 1-4 of these jobs will hang.
In the hanging state, the new task created by
do_detach_teleport
is stuck in a signal-stop:gives
i.e. that task sent SIGSTOP to itself (which makes sense) but we don't see any other signal being sent to it. The
perf record
has lost some events so it's not entirely reliable though.