rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.16k stars 587 forks source link

Intermittent failures in tests involving detaching #3597

Closed rocallahan closed 1 year ago

rocallahan commented 1 year ago

These happen in CI, e.g. https://buildkite.com/julialang/rr/builds/1367#018a6c18-26e2-4496-8eb5-3a7c0e12dcc8

I believe I can also reproduce the same or a similar issue in a test VM. E.g. I run

for i in `seq 1 1000`; do bash nested_detach.run & done

and usually 1-4 of these jobs will hang.

In the hanging state, the new task created by do_detach_teleport is stuck in a signal-stop:

ubuntu@ip-172-31-34-64:~/rr/src/test$ sudo cat /proc/385000/stack
[<0>] do_signal_stop+0xfa/0x250
[<0>] get_signal+0x4f8/0x900
[<0>] arch_do_signal_or_restart+0xf7/0x290
[<0>] exit_to_user_mode_prepare+0x130/0x1c0
[<0>] syscall_exit_to_user_mode+0x27/0x50
[<0>] do_syscall_64+0x69/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb
sudo perf record -c 1 --call-graph dwarf -e signal:signal_generate
...
sudo perf script |grep 385000

gives

:385000 385000 [005]  3350.998085: signal:signal_generate: sig=19 errno=0 code=0 comm=rr pid=385000 grp=1 res=0
:385000 385000 [005]  3350.998099: signal:signal_generate: sig=17 errno=0 code=5 comm=rr pid=373610 grp=1 res=0

i.e. that task sent SIGSTOP to itself (which makes sense) but we don't see any other signal being sent to it. The perf record has lost some events so it's not entirely reliable though.

rocallahan commented 1 year ago

I do not see any errors in record.err other than timeout. The trace looks complete; it ends after rr_detach_teleport:

{
  real_time:3351.467504 global_time:255, event:`SYSCALL: rrcall_detach_teleport' (state:ENTERING_SYSCALL) tid:376689, ticks:131445
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffd40 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x564618afcb70 r13:0x7ffd9d5d7390 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3f1 fs_base:0x7f3de3ce6740 gs_base:0x0
}
{
  real_time:3351.467796 global_time:256, event:`EXIT' tid:376689, ticks:131445
}

and rr is waiting for the detached task to finish:

Thread 1 (Thread 0x7f05c50c4740 (LWP 373610)):
#0  0x00007f05c51abd39 in __waitid (idtype=P_ALL, id=4294967295, infop=0x7ffd9a023d90, options=1073741830) at ../sysdeps/unix/sysv/linux/waitid.c:29
#1  0x0000561d0a57f98e in rr::WaitState::do_wait (this=0x561d0a8cdfa0 <rr::wait_state()::static_state>, tid=-1, consume=true, type=6, block_seconds=315360000, status=...) at /home/ubuntu/rr/src/WaitManager.cc:58
#2  0x0000561d0a57fe2f in rr::WaitState::wait (this=0x561d0a8cdfa0 <rr::wait_state()::static_state>, options=..., type=6) at /home/ubuntu/rr/src/WaitManager.cc:110
#3  0x0000561d0a580591 in rr::WaitManager::wait_stop_or_exit (options=...) at /home/ubuntu/rr/src/WaitManager.cc:214
#4  0x0000561d0a4e73fb in rr::wait_any (tid=@0x7ffd9a024060: 385000, status=..., timeout=0) at /home/ubuntu/rr/src/Scheduler.cc:528
#5  0x0000561d0a4e8fae in rr::Scheduler::reschedule (this=0x561d0ae35220, switchable=rr::ALLOW_SWITCH) at /home/ubuntu/rr/src/Scheduler.cc:870
#6  0x0000561d0a3f07a8 in rr::RecordSession::record_step (this=0x561d0ae34fd0) at /home/ubuntu/rr/src/RecordSession.cc:2497
#7  0x0000561d0a3e3085 in rr::record (args=..., flags=...) at /home/ubuntu/rr/src/RecordCommand.cc:697
#8  0x0000561d0a3e3e71 in rr::RecordCommand::run (this=0x561d0a8cc0b0 <rr::RecordCommand::singleton>, args=...) at /home/ubuntu/rr/src/RecordCommand.cc:860
#9  0x0000561d0a3a2fe8 in main (argc=11, argv=0x7ffd9a024568) at /home/ubuntu/rr/src/main.cc:278

I don't see how do_detach_teleport can complete successfully without the new task leaving its SIGSTOP. We have to do an exec in it for example.

rocallahan commented 1 year ago

Logging shows that the detached task is in the signal-stop after do_detach_teleport has completed new_t->dup_from(t).

rocallahan commented 1 year ago

Sending SIGCONT to the detached task after PTRACE_DETACHing it fixes the problem. But I'd like to know how the task manages to be in the stopped state after we've done all the remote syscalls to set it up.

rocallahan commented 1 year ago

OK this is 100% reproducible if I insert sleep(1); in Task::spawn before we call ptrace_seize(), so that the task SIGSTOPs itself before we take ptrace control, i.e. it goes into group-stop state.

I still expect the PTRACE_SYSCALL etc that we do to exec the spawned task would clear the group-stop, but apparently it doesn't.

rocallahan commented 1 year ago

Yeah, it's pretty clear SIGNAL_STOP_STOPPED remains pending in the task until a SIGCONT happens. The group stop is just not effective while the task is ptraced.