rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.21k stars 585 forks source link

test `pid_ns_reap` fails, crashes replaying rr since b04220a in "slow" VM. #2928

Closed bernhardu closed 3 years ago

bernhardu commented 3 years ago

Because my system capable of running rr capable VMs was already up, I gave the rr tests yesterday a try on a 32-bit kernel. There I found the pid_ns_reap test just hanging. With the latest commit b04220a recording now finishes in 32-bit, but crashes rr on replay. Following crash I could then find also on the same VM host with a 64-bit VM with 64-bit rr. This crash I can not see on my faster main machine.

benutzer@debian:/.../x86_64/obj$ bin/rr --check-cached-mmaps record bin/pid_ns_reap                        
rr: Saving execution to trace directory `/home/benutzer/.local/share/rr/pid_ns_reap-1'.
EXIT-SUCCESS
benutzer@debian:/.../x86_64/obj$ bin/rr --check-cached-mmaps replay -a pid_ns_reap-1                      
Segmentation fault (core dumped)
benutzer@debian:/.../x86_64/obj$ gdb -q -ex run --args bin/rr --check-cached-mmaps replay -a pid_ns_reap-1
Reading symbols from bin/rr...
Starting program: /.../x86_64/obj/bin/rr --check-cached-mmaps replay -a pid_ns_reap-1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000555555a860ea in rr::cpuid (code=0, subrequest=0) at /.../rr/src/util.cc:846
846       asm volatile("cpuid"
(gdb) cont
Continuing.
[Detaching after fork from child process 34204]

Program received signal SIGSEGV, Segmentation fault.
0x0000555555857650 in std::__shared_ptr<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2>::get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1325
1325          { return _M_ptr; }
(gdb) bt
#0  0x0000555555857650 in std::__shared_ptr<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2>::get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1325
#1  0x0000555555851916 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1024
#2  0x000055555584bfb0 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::operator-> (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1018
#3  0x0000555555a506de in rr::Task::write_bytes_helper_no_notifications (this=0x0, addr=..., buf_size=1, buf=0x555555cb9df0, ok=0x0, flags=0)
    at /.../rr/src/Task.cc:2881
#4  0x0000555555a5058f in rr::Task::write_bytes_helper (this=0x0, addr=..., buf_size=1, buf=0x555555cb9df0, ok=0x0, flags=0) at /.../rr/src/Task.cc:2868
#5  0x00005555559f1a77 in rr::ReplayTask::apply_all_data_records_from_trace (this=0x555555cba720) at /.../rr/src/ReplayTask.cc:131
#6  0x00005555559da684 in rr::ReplaySession::exit_syscall (this=0x555555c9d980, t=0x555555cba720) at /.../rr/src/ReplaySession.cc:647
#7  0x00005555559de6ec in rr::ReplaySession::try_one_trace_step (this=0x555555c9d980, t=0x555555cba720, constraints=...) at /.../rr/src/ReplaySession.cc:1487
#8  0x00005555559dfb0c in rr::ReplaySession::replay_step (this=0x555555c9d980, constraints=...) at /.../rr/src/ReplaySession.cc:1769
#9  0x00005555559d661e in rr::ReplaySession::replay_step (this=0x555555c9d980, command=rr::RUN_CONTINUE) at /.../rr/src/ReplaySession.h:286
#10 0x00005555559d5037 in rr::serve_replay_no_debugger (trace_dir="pid_ns_reap-1", flags=...) at /.../rr/src/ReplayCommand.cc:370
#11 0x00005555559d55a1 in rr::replay (trace_dir="pid_ns_reap-1", flags=...) at /.../rr/src/ReplayCommand.cc:458
#12 0x00005555559d6303 in rr::ReplayCommand::run (this=0x555555c89840 <rr::ReplayCommand::singleton>, args=std::vector of length 0, capacity 4)
    at /.../rr/src/ReplayCommand.cc:624
#13 0x0000555555aa2c31 in main (argc=5, argv=0x7fffffffe518) at /.../rr/src/main.cc:249
(gdb) display/i $pc
1: x/i $pc
=> 0x555555857650 <_ZNKSt12__shared_ptrIN2rr12AddressSpaceELN9__gnu_cxx12_Lock_policyE2EE3getEv+12>:    mov    (%rax),%rax
(gdb) print/x $rax
$1 = 0x108
(gdb) up
#1  0x0000555555851916 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1024
1024          { return static_cast<const __shared_ptr<_Tp, _Lp>*>(this)->get(); }
(gdb) 
#2  0x000055555584bfb0 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::operator-> (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1018
1018            return _M_get();
(gdb) 
#3  0x0000555555a506de in rr::Task::write_bytes_helper_no_notifications (this=0x0, addr=..., buf_size=1, buf=0x555555cb9df0, ok=0x0, flags=0)
    at /.../rr/src/Task.cc:2881
2881      if (uint8_t* local_addr = as->local_mapping(addr, buf_size)) {
(gdb) 
#4  0x0000555555a5058f in rr::Task::write_bytes_helper (this=0x0, addr=..., buf_size=1, buf=0x555555cb9df0, ok=0x0, flags=0) at /.../rr/src/Task.cc:2868
2868      ssize_t nwritten = write_bytes_helper_no_notifications(addr, buf_size, buf, ok, flags);
(gdb) up
#5  0x00005555559f1a77 in rr::ReplayTask::apply_all_data_records_from_trace (this=0x555555cba720) at /.../rr/src/ReplayTask.cc:131
131           t->write_bytes_helper(buf.addr, buf.data.size(), buf.data.data());
(gdb) print t
$2 = (rr::ReplayTask *) 0x0
(gdb) print this
$3 = (rr::ReplayTask * const) 0x555555cba720
(gdb) print *this
$4 = {<rr::Task> = {_vptr.Task = 0x555555c85ab8 <vtable for rr::ReplayTask+16>, scratch_ptr = {ptr = 140039639130112}, scratch_size = 2097152, desched_fd_child = -1, cloned_file_data_fd_child = -1, 
    cloned_file_data_fname = "", hpc = {counting_period = 1152921504606846976, tid = 34224, fd_ticks_measure = {fd = 14}, fd_minus_ticks_measure = {fd = -1}, fd_ticks_interrupt = {fd = 7}, 
      fd_useless_counter = {fd = -1}, fd_ticks_in_transaction = {fd = -1}, fd_strex_counter = {fd = -1}, ticks_semantics_ = rr::TICKS_RETIRED_CONDITIONAL_BRANCHES, started = true, counting = false}, 
    tid = 34224, rec_tid = 32737, own_namespace_rec_tid = 4, syscallbuf_size = 1048576, syscallbuf_child = {ptr = 0}, preload_globals = {ptr = 140039655916096}, 
    thread_locals = "\000\340\021\205]\177", '\000' <repeats 97 times>, serial = 6, as = std::shared_ptr<rr::AddressSpace> (use count 1, weak count 0) = {get() = 0x555555c9e950}, 
    fds = std::shared_ptr<rr::FdTable> (use count 1, weak count 0) = {get() = 0x555555c9ec60}, prname = "pid_ns_reap", ticks = 822, registers = {arch_ = rr::x86_64, u = {x86regs = {ebx = 0, ecx = 0, 
          edx = -2047968536, esi = 32605, edi = 4, ebp = 0, eax = -2062426704, xds = 32605, xes = 3, xfs = 0, xgs = -2062426208, orig_eax = 32605, eip = 582, xcs = 0, eflags = 0, esp = 0, xss = 0}, 
        x64regs = {r15 = 0, r14 = 140039655684840, r13 = 4, r12 = 140039641226672, rbp = 3, rbx = 140039641227168, r11 = 582, r10 = 0, r9 = 0, r8 = 0, rax = 18446744073709551578, 
          rcx = 18446744073709551615, rdx = 100, rsi = 1030, rdi = 3, orig_rax = 72, rip = 1879048197, cs = 51, eflags = 582, rsp = 140039641226496, ss = 43, fs_base = 140039649617664, gs_base = 0, 
          ds = 0, es = 0, fs = 0, gs = 0}, arm64regs = {{_ptrace = {x = {0, 140039655684840, 4, 140039641226672, 3, 140039641227168, 582, 0, 0, 0, 18446744073709551578, 18446744073709551615, 100, 1030, 3, 
                72, 1879048197, 51, 582, 140039641226496, 43, 140039649617664, 0, 0, 0, 0, 0, 0, 0, 0, 0}, sp = 0, pc = 0, pstate = 0}, {x = {0, 140039655684840, 4, 140039641226672, 3, 140039641227168, 
                582, 0, 0, 0, 18446744073709551578, 18446744073709551615, 100, 1030, 3, 72, 1879048197, 51, 582, 140039641226496, 43, 140039649617664, 0, 0, 0, 0, 0, 0, 0, 0, 0}, sp = 0, pc = 0, 
              pstate = 0}}, orig_x0 = 0, orig_syscall = 0}}}, address_of_last_execution_resume = {ptr = 1879048197}, how_last_execution_resumed = rr::RESUME_SYSEMU, last_resume_orig_cx = 0, 
    singlestepping_instruction = rr::TrappedInstruction::NONE, did_set_breakpoint_after_cpuid = false, is_stopped = true, seccomp_bpf_enabled = false, detected_unexpected_exit = false, 
    registers_dirty = true, orig_syscallno_dirty = true, extra_registers = {format_ = rr::ExtraRegisters::NONE, arch_ = rr::x86_64, data_ = std::vector of length 0, capacity 0}, 
    extra_registers_known = false, session_ = 0x555555c9d980, tg = std::shared_ptr<rr::ThreadGroup> (use count 2, weak count 0) = {get() = 0x555555cb2110}, 
    thread_areas_ = std::vector of length 0, capacity 0, top_of_stack = {ptr = 140039649615792}, wait_status = {status = 1407}, pending_siginfo = {si_signo = 5, si_errno = 0, si_code = 128, __pad0 = 0, 
      _sifields = {_pad = {0 <repeats 28 times>}, _kill = {si_pid = 0, si_uid = 0}, _timer = {si_tid = 0, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 0, si_uid = 0, 
          si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x0, si_addr_lsb = 0, _bounds = {
            _addr_bnd = {_lower = 0x0, _upper = 0x0}, _pkey = 0}}, _sigpoll = {si_band = 0, si_fd = 0}, _sigsys = {_call_addr = 0x0, _syscall = 0, _arch = 0}}}, seen_ptrace_exit_event = false, 
    handled_ptrace_exit_event = false, properties_ = {values = std::vector of length 0, capacity 0}, expecting_ptrace_interrupt_stop = 0, was_reaped = false}, <No data fields>}
rocallahan commented 3 years ago

Can you pack the trace and send it to me?

bernhardu commented 3 years ago

The recording from above I don't have anymore because these VMs are just temporary ones. I retried a new recording with git 6fb44fbeb. The first few attempts did not succeed. When I put some load on the system by running tests in the background it started to fail again. pid_ns_reap-2.tar.gz

$ gdb -q -ex run --args bin/rr replay -a pid_ns_reap-2
Reading symbols from bin/rr...
Starting program: /home/bernhard/data/entwicklung/2021/rr/2021-04-25/x86_64/obj/bin/rr replay -a pid_ns_reap-2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000555555a86136 in rr::cpuid (code=0, subrequest=0) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/util.cc:846
846       asm volatile("cpuid"
(gdb) cont
Continuing.
[Detaching after fork from child process 72281]

Program received signal SIGSEGV, Segmentation fault.
0x0000555555857650 in std::__shared_ptr<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2>::get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1325
1325          { return _M_ptr; }
(gdb) bt
#0  0x0000555555857650 in std::__shared_ptr<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2>::get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1325
#1  0x0000555555851916 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1024
#2  0x000055555584bfb0 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::operator-> (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1018
#3  0x0000555555a5072a in rr::Task::write_bytes_helper_no_notifications (this=0x0, addr=..., buf_size=1, buf=0x555555caff70, ok=0x0, flags=0) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/Task.cc:2881
#4  0x0000555555a505db in rr::Task::write_bytes_helper (this=0x0, addr=..., buf_size=1, buf=0x555555caff70, ok=0x0, flags=0) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/Task.cc:2868
#5  0x00005555559f1ac3 in rr::ReplayTask::apply_all_data_records_from_trace (this=0x555555cba420) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayTask.cc:131
#6  0x00005555559da6d0 in rr::ReplaySession::exit_syscall (this=0x555555c9d980, t=0x555555cba420) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.cc:647
#7  0x00005555559de738 in rr::ReplaySession::try_one_trace_step (this=0x555555c9d980, t=0x555555cba420, constraints=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.cc:1487
#8  0x00005555559dfb58 in rr::ReplaySession::replay_step (this=0x555555c9d980, constraints=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.cc:1769
#9  0x00005555559d666a in rr::ReplaySession::replay_step (this=0x555555c9d980, command=rr::RUN_CONTINUE) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.h:286
#10 0x00005555559d5083 in rr::serve_replay_no_debugger (trace_dir="pid_ns_reap-2", flags=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayCommand.cc:370
#11 0x00005555559d55ed in rr::replay (trace_dir="pid_ns_reap-2", flags=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayCommand.cc:458
#12 0x00005555559d634f in rr::ReplayCommand::run (this=0x555555c89840 <rr::ReplayCommand::singleton>, args=std::vector of length 0, capacity 4) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayCommand.cc:624
#13 0x0000555555aa2c81 in main (argc=4, argv=0x7fffffffe528) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/main.cc:249

And with current git head e8331b9f it happens also, but could observe it also just with some load on the system. pid_ns_reap-5.tar.gz

rocallahan commented 3 years ago

I see what the problem is. Can you try this patch? https://gist.github.com/rocallahan/eee63245a75b6f27b5dce4548e446191

bernhardu commented 3 years ago

With your patch on top of e8331b9fe I could not reproduce the crash in 40 attempts in my VM under load. Before the issue was visible in nearly every run. Currently the test suite is still running ...

rocallahan commented 3 years ago

Great!

rocallahan commented 3 years ago

Fixed by 9814e444da4cd3e858143dfcd00dd93abb873f94

bernhardu commented 3 years ago

... and the tests for 64-bit rr finished with all succeding (two on a second attempt). And now 32-bit rr at a 32-bit kernel currently running ... and did finish all tests successfully. Thank you for fixing this.