Closed bernhardu closed 3 years ago
Can you pack the trace and send it to me?
The recording from above I don't have anymore because these VMs are just temporary ones. I retried a new recording with git 6fb44fbeb. The first few attempts did not succeed. When I put some load on the system by running tests in the background it started to fail again. pid_ns_reap-2.tar.gz
$ gdb -q -ex run --args bin/rr replay -a pid_ns_reap-2
Reading symbols from bin/rr...
Starting program: /home/bernhard/data/entwicklung/2021/rr/2021-04-25/x86_64/obj/bin/rr replay -a pid_ns_reap-2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x0000555555a86136 in rr::cpuid (code=0, subrequest=0) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/util.cc:846
846 asm volatile("cpuid"
(gdb) cont
Continuing.
[Detaching after fork from child process 72281]
Program received signal SIGSEGV, Segmentation fault.
0x0000555555857650 in std::__shared_ptr<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2>::get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1325
1325 { return _M_ptr; }
(gdb) bt
#0 0x0000555555857650 in std::__shared_ptr<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2>::get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1325
#1 0x0000555555851916 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1024
#2 0x000055555584bfb0 in std::__shared_ptr_access<rr::AddressSpace, (__gnu_cxx::_Lock_policy)2, false, false>::operator-> (this=0x108) at /usr/include/c++/10/bits/shared_ptr_base.h:1018
#3 0x0000555555a5072a in rr::Task::write_bytes_helper_no_notifications (this=0x0, addr=..., buf_size=1, buf=0x555555caff70, ok=0x0, flags=0) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/Task.cc:2881
#4 0x0000555555a505db in rr::Task::write_bytes_helper (this=0x0, addr=..., buf_size=1, buf=0x555555caff70, ok=0x0, flags=0) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/Task.cc:2868
#5 0x00005555559f1ac3 in rr::ReplayTask::apply_all_data_records_from_trace (this=0x555555cba420) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayTask.cc:131
#6 0x00005555559da6d0 in rr::ReplaySession::exit_syscall (this=0x555555c9d980, t=0x555555cba420) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.cc:647
#7 0x00005555559de738 in rr::ReplaySession::try_one_trace_step (this=0x555555c9d980, t=0x555555cba420, constraints=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.cc:1487
#8 0x00005555559dfb58 in rr::ReplaySession::replay_step (this=0x555555c9d980, constraints=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.cc:1769
#9 0x00005555559d666a in rr::ReplaySession::replay_step (this=0x555555c9d980, command=rr::RUN_CONTINUE) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplaySession.h:286
#10 0x00005555559d5083 in rr::serve_replay_no_debugger (trace_dir="pid_ns_reap-2", flags=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayCommand.cc:370
#11 0x00005555559d55ed in rr::replay (trace_dir="pid_ns_reap-2", flags=...) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayCommand.cc:458
#12 0x00005555559d634f in rr::ReplayCommand::run (this=0x555555c89840 <rr::ReplayCommand::singleton>, args=std::vector of length 0, capacity 4) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/ReplayCommand.cc:624
#13 0x0000555555aa2c81 in main (argc=4, argv=0x7fffffffe528) at /home/bernhard/data/entwicklung/2021/rr/2021-04-25/rr/src/main.cc:249
And with current git head e8331b9f it happens also, but could observe it also just with some load on the system. pid_ns_reap-5.tar.gz
I see what the problem is. Can you try this patch? https://gist.github.com/rocallahan/eee63245a75b6f27b5dce4548e446191
With your patch on top of e8331b9fe I could not reproduce the crash in 40 attempts in my VM under load. Before the issue was visible in nearly every run. Currently the test suite is still running ...
Great!
Fixed by 9814e444da4cd3e858143dfcd00dd93abb873f94
... and the tests for 64-bit rr finished with all succeding (two on a second attempt). And now 32-bit rr at a 32-bit kernel currently running ... and did finish all tests successfully. Thank you for fixing this.
Because my system capable of running rr capable VMs was already up, I gave the rr tests yesterday a try on a 32-bit kernel. There I found the
pid_ns_reap
test just hanging. With the latest commit b04220a recording now finishes in 32-bit, but crashes rr on replay. Following crash I could then find also on the same VM host with a 64-bit VM with 64-bit rr. This crash I can not see on my faster main machine.