rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.08k stars 578 forks source link

Intermittent failure of `nested_detach_wait` #3564

Closed rocallahan closed 1 year ago

rocallahan commented 1 year ago

See https://buildkite.com/julialang/rr/builds/1276#0189659b-e196-4e59-b7f9-759015ae4e07

Test 'nested_detach_wait' FAILED: : error during recording:
--------------------------------------------------
[FATAL src/Task.cc:2133:did_waitpid()]
 (task 249677 (rec:249847) at time 482)
 -> Assertion `!registers_dirty' failed to hold. Registers shouldn't already be dirty (status is 0x137f (STOP-SIGSTOP))
Tail of trace dump:
{
  real_time:7119504.405343 global_time:462, event:`SYSCALLBUF_RESET' tid:249677, ticks:701720
}
{
  real_time:7119504.405494 global_time:463, event:`SYSCALL: rrcall_check_presence' (state:ENTERING_SYSCALL) tid:249677, ticks:701729
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffdc0 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x5 r13:0x7ffc550e7428 r14:0x7ffc550e7458 r15:0x5645f5d59c08 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3f0 fs_base:0x7f374a0d9740 gs_base:0x0
}
{
  real_time:7119504.406159 global_time:464, event:`SYSCALL: rrcall_check_presence' (state:EXITING_SYSCALL) tid:249677, ticks:701729
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffdc0 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x5 r13:0x7ffc550e7428 r14:0x7ffc550e7458 r15:0x5645f5d59c08 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3f0 fs_base:0x7f374a0d9740 gs_base:0x0
}
{
  real_time:7119504.406666 global_time:465, event:`PATCH_SYSCALL' tid:249677, ticks:708408
rax:0x101 rbx:0x7ffc550e90f0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x5645f5be5bac rdi:0xffffff9c rbp:0x5645f5be5bac rsp:0x7ffc550e7180 r8:0x0 r9:0x7f374a2acc00 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x0 r15:0x0 rip:0x7f374a6d5825 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f374a0d9740 gs_base:0x0
  { tid:249677, addr:0x7f374a7bb0bc, length:0x5e }
  { tid:249677, addr:0x7f374a6d5825, length:0x8 }
}
{
  real_time:7119504.406907 global_time:466, event:`SYSCALLBUF_FLUSH' tid:249677, ticks:708904
  { syscall:'openat', ret:0x7, size:0x10, desched:1 }
  { syscall:'readlinkat', ret:0xc, size:0x1c }
}
{
  real_time:7119504.407019 global_time:467, event:`PATCH_SYSCALL' tid:249677, ticks:708904
rax:0x0 rbx:0x7ffc550e90f0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7ffc550e728c rdi:0x7 rbp:0x7ffc550e7270 rsp:0x7ffc550e7228 r8:0x0 r9:0x7f374a2acc00 r10:0x5645f5792a57 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x7f374a6d510c eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f374a0d9740 gs_base:0x0
  { tid:249677, addr:0x7f374a7bb11a, length:0x5e }
  { tid:249677, addr:0x7f374a6d510c, length:0x8 }
}
{
  real_time:7119504.407039 global_time:468, event:`SYSCALLBUF_RESET' tid:249677, ticks:708904
}
{
  real_time:7119504.407214 global_time:469, event:`SYSCALLBUF_FLUSH' tid:249677, ticks:709040
  { syscall:'read', ret:0x4, size:0x14, desched:1 }
}
{
  real_time:7119504.407324 global_time:470, event:`PATCH_SYSCALL' tid:249677, ticks:709040
rax:0x3 rbx:0x7ffc550e90f0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7ffc550e728c rdi:0x7 rbp:0x7ffc550e7200 rsp:0x7ffc550e71c8 r8:0x0 r9:0x7f374a2acc00 r10:0x7f374a7f8670 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x7f374a6d51b1 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f374a0d9740 gs_base:0x0
  { tid:249677, addr:0x7f374a7bb178, length:0x5e }
  { tid:249677, addr:0x7f374a6d51b1, length:0x8 }
}
{
  real_time:7119504.407344 global_time:471, event:`SYSCALLBUF_RESET' tid:249677, ticks:709040
}
{
  real_time:7119504.407544 global_time:472, event:`SYSCALLBUF_FLUSH' tid:249677, ticks:711419
  { syscall:'close', ret:0x0, size:0x10 }
}
{
  real_time:7119504.407662 global_time:473, event:`PATCH_SYSCALL' tid:249677, ticks:711419
rax:0x12e rbx:0x7ffc550e90f0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x7 rdi:0x0 rbp:0x7ffc550e7290 rsp:0x7ffc550e7248 r8:0x7f374a2ac1c4 r9:0xaacc6754 r10:0x5645f5d61b50 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x7f374a1cddfe eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f374a0d9740 gs_base:0x0
  { tid:249677, addr:0x7f374a7bb1d6, length:0x5e }
  { tid:249677, addr:0x7f374a1cddfe, length:0x8 }
}
{
  real_time:7119504.407680 global_time:474, event:`SYSCALLBUF_RESET' tid:249677, ticks:711419
}
{
  real_time:7119504.408338 global_time:475, event:`SYSCALL: prlimit64' (state:ENTERING_SYSCALL) tid:249677, ticks:711429
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x7 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffdc0 r8:0x7f374a2ac1c4 r9:0xaacc6754 r10:0x5645f5d61b50 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x12e fs_base:0x7f374a0d9740 gs_base:0x0
}
{
  real_time:7119504.409072 global_time:476, event:`SYSCALL: prlimit64' (state:EXITING_SYSCALL) tid:249677, ticks:711429
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x7 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffdc0 r8:0x7f374a2ac1c4 r9:0xaacc6754 r10:0x5645f5d61b50 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x12e fs_base:0x7f374a0d9740 gs_base:0x0
  { tid:249677, addr:0x5645f5d61b50, length:0x10 }
}
{
  real_time:7119504.409399 global_time:477, event:`PATCH_SYSCALL' tid:249677, ticks:711592
rax:0x12e rbx:0x7ffc550e90f0 rcx:0xffffffffffffffff rdx:0x7ffc550e7250 rsi:0x7 rdi:0x0 rbp:0x7ffc550e7290 rsp:0x7ffc550e7248 r8:0x7f374a2ac1c4 r9:0xaacc6754 r10:0x0 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x7f374a1cde3f eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f374a0d9740 gs_base:0x0
  { tid:249677, addr:0x7f374a7bb234, length:0x5e }
  { tid:249677, addr:0x7f374a1cde3f, length:0x8 }
}
{
  real_time:7119504.409557 global_time:478, event:`SYSCALL: prlimit64' (state:ENTERING_SYSCALL) tid:249677, ticks:711602
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x7ffc550e7250 rsi:0x7 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffdc0 r8:0x7f374a2ac1c4 r9:0xaacc6754 r10:0x0 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x12e fs_base:0x7f374a0d9740 gs_base:0x0
}
{
  real_time:7119504.410234 global_time:479, event:`SYSCALL: prlimit64' (state:EXITING_SYSCALL) tid:249677, ticks:711602
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x7ffc550e7250 rsi:0x7 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffdc0 r8:0x7f374a2ac1c4 r9:0xaacc6754 r10:0x0 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x12e fs_base:0x7f374a0d9740 gs_base:0x0
}
{
  real_time:7119504.485541 global_time:480, event:`SYSCALL: rrcall_detach_teleport' (state:ENTERING_SYSCALL) tid:249677, ticks:714841
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x681fffa0 rsp:0x681ffdc0 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x5645f58b6750 r13:0x0 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3f1 fs_base:0x7f374a0d9740 gs_base:0x0
}
{
  real_time:7119504.485949 global_time:481, event:`EXIT' tid:249677, ticks:714841
}
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x38)[0x558563678563]
rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0xe8)[0x55856348b6b8]
rr(+0x40c415)[0x5585634c0415]
rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x5f)[0x5585634c0655]
rr(_ZN2rr4Task11did_waitpidENS_10WaitStatusE+0x4cd)[0x55856363894f]
rr(+0x541f4f)[0x5585635f5f4f]
rr(_ZN2rr9Scheduler10rescheduleENS_10SwitchableE+0x11b7)[0x5585635f73c5]
rr(_ZN2rr13RecordSession11record_stepEv+0x127)[0x55856351817b]
rocallahan commented 1 year ago

It looks like after detaching a SIGSTOP in the detached task was reported to rr. I wonder what sent that signal.

rocallahan commented 1 year ago

Anyway I can reproduce this by sending SIGSTOP to the detached task, which I guess could happen in real life, so I'll fix that and hope that that also prevents this intermittent failure.