rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.13k stars 583 forks source link

Use rr for randomized fault injection #778

Open joneschrisg opened 10 years ago

joneschrisg commented 10 years ago

Complementary to @rocallahan's gecko chaos mode. There are bazillion other tools that do this, but implementing this in rr is particularly compelling because

This could be done with relatively little pain by adding a fault injection hook to rec_process_event.cc, and then recording pseudo-events when faults are injected so that they can be replayed.

amccreight commented 8 years ago

I'm sort of interested in looking at this. What is the current equivalent of rec_process_event.cc?

rocallahan commented 8 years ago

record_syscall.cc

zouyonghao commented 9 months ago

I'm working on this. I modified the function process_syscall_results based on my need like:

   if (should_emulate_result) {
     Registers r = t->regs();
     r.set_syscall_result(emulated_result);
     t->set_regs(r);
   }

+  Registers r = t->regs();
+  r.set_syscall_result(random_values);
+  t->set_regs(r);

   for (auto& action : after_syscall_actions) {
     action(t);
   }

It works for recording, but when I tried to replay, I sometimes got

[ERROR src/Registers.cc:299:maybe_print_reg_mismatch()] r12 0x1 != 0xfffffffffffffff7 (replaying vs. recorded)
[FATAL src/Registers.cc:409:compare_register_files()]
 (task 138097 (rec:46669) at time 11761)
 -> Assertion `!bail_error || match' failed to hold. Fatal register mismatch (ticks/rec:2391/2384)

Any suggestions for this, where should I modify correspondingly @rocallahan ?

Many thanks!

khuey commented 9 months ago

Which syscalls are you messing with? I could see this causing problems for syscalls that have special handling during replay.

zouyonghao commented 9 months ago

several file/network related:

case __NR_read:
case __NR_write:
case __NR_readv:
case __NR_preadv:
case __NR_preadv2:
case __NR_writev:
case __NR_pwritev:
case __NR_pwritev2:
case __NR_recvfrom:
case __NR_sendto:
case __NR_recvmsg:
case __NR_sendmsg:
case __NR_recvmmsg:
case __NR_sendmmsg:
rocallahan commented 9 months ago

Push the actual branch somewhere so we can see the details?

zouyonghao commented 9 months ago

Thanks for the reply! The code is inside a research project that will only be made opensource after acceptance, so sorry for that.

But the code logic is like: In record_syscall.cc:

  if (should_emulate_result) {
    Registers r = t->regs();
    r.set_syscall_result(emulated_result);
    t->set_regs(r);
  }

  + call_other_func(t);

  for (auto& action : after_syscall_actions) {
    action(t);
  }

In the file of call_other_func:

switch (t->regs().original_syscallno()) {
  case __NR_write:
  case __NR_readv:
  case __NR_preadv:
  case __NR_preadv2:
  case __NR_writev:
  case __NR_pwritev:
  case __NR_pwritev2:
  case __NR_recvfrom:
  case __NR_sendto:
  case __NR_recvmsg:
  case __NR_sendmsg:
  case __NR_recvmmsg:
  case __NR_sendmmsg:
    if (condition_value) {
      ret_val = error_code;
      r.set_syscall_result(-ret_val);
      t->set_regs(r);
    }
  break;
}

I also try to know which syscall that causes the mismatch and print the syscallno in the enter_syscall in replay session, it seems it is related to the recvfrom syscall:

enter syscall: 14 original syscall no:-1
enter syscall: -9 original syscall no:45
[ERROR src/Registers.cc:299:maybe_print_reg_mismatch()] r12 0x1 != 0xfffffffffffffff7 (replaying vs. recorded)
[FATAL src/Registers.cc:417:compare_register_files()]
 (task 2576894 (rec:46669) at time 11761)
 -> Assertion `!bail_error || match' failed to hold. reg1:-38 reg2:-38
Fatal register mismatch (ticks/rec:2391/2384)
Tail of trace dump:
{
  real_time:16138237.420949 global_time:11741, event:`SYSCALL: sysinfo' (state:ENTERING_SYSCALL) tid:46589, ticks:106526269
rax:0xffffffffffffffda rbx:0x7ffff459afa0 rcx:0xffffffffffffffff rdx:0x1d0909b38 rsi:0x7ffff469aa40 rdi:0x7ffff469aa90 rbp:0x7ffff459afa0 rsp:0x7ffff459ad40 r8:0x7ffff5e0ffa8 r9:0x7ffff0064680 r10:0x31 r11:0x246 r12:0x7ffff0063800 r13:0x7ffff469ad40 r14:0x7ffff009db20 r15:0x7ffff5e0ffa8 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x63 fs_base:0x7ffff469b700 gs_base:0x0
}
{
  real_time:16138237.420961 global_time:11742, event:`SYSCALLBUF_RESET' tid:46589, ticks:106526269
}
{
  real_time:16138237.421032 global_time:11743, event:`SYSCALL: sysinfo' (state:EXITING_SYSCALL) tid:46589, ticks:106526269
rax:0x0 rbx:0x7ffff459afa0 rcx:0xffffffffffffffff rdx:0x1d0909b38 rsi:0x7ffff469aa40 rdi:0x7ffff469aa90 rbp:0x7ffff459afa0 rsp:0x7ffff459ad40 r8:0x7ffff5e0ffa8 r9:0x7ffff0064680 r10:0x31 r11:0x246 r12:0x7ffff0063800 r13:0x7ffff469ad40 r14:0x7ffff009db20 r15:0x7ffff5e0ffa8 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x63 fs_base:0x7ffff469b700 gs_base:0x0
  { tid:46589, addr:0x7ffff469aa90, length:0x70 }
}
{
  real_time:16138237.421243 global_time:11744, event:`SYSCALLBUF_FLUSH' tid:46589, ticks:106526432
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:16138237.421253 global_time:11745, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:46589, ticks:106526432
rax:0xffffffffffffffda rbx:0x7ffff459afa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x89 rdi:0x7ffff0064c78 rbp:0x7ffff459afa0 rsp:0x7ffff459ad40 r8:0x0 r9:0xffffffff r10:0x7ffff469ab90 r11:0x246 r12:0x7ffff0064c74 r13:0x7ffff0064c28 r14:0x7ffff0064c78 r15:0x7ffff469ab00 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7ffff469b700 gs_base:0x0
}
{
  real_time:16138237.421265 global_time:11746, event:`SYSCALLBUF_RESET' tid:46589, ticks:106526432
}
{
  real_time:16138237.421371 global_time:11747, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:46584, ticks:110259083
rax:0x0 rbx:0x7ffff489bfa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x89 rdi:0x7ffff0062b78 rbp:0x7ffff489bfa0 rsp:0x7ffff489bd40 r8:0x0 r9:0xffffffff r10:0x7ffff499bb90 r11:0x246 r12:0x7ffff0062b74 r13:0x7ffff0062b28 r14:0x7ffff0062b78 r15:0x7ffff499bb00 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7ffff499c700 gs_base:0x0
}
{
  real_time:16138237.421566 global_time:11748, event:`SYSCALLBUF_FLUSH' tid:46584, ticks:110259238
  { syscall:'futex', ret:0x0, size:0x14 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:16138237.421577 global_time:11749, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:46584, ticks:110259238
rax:0xffffffffffffffda rbx:0x7ffff489bfa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x89 rdi:0x7ffff0062b7c rbp:0x7ffff489bfa0 rsp:0x7ffff489bd40 r8:0x0 r9:0xffffffff r10:0x7ffff499bb90 r11:0x246 r12:0x7ffff0062b74 r13:0x7ffff0062b28 r14:0x7ffff0062b7c r15:0x7ffff499bb00 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7ffff499c700 gs_base:0x0
}
{
  real_time:16138237.421588 global_time:11750, event:`SYSCALLBUF_RESET' tid:46584, ticks:110259238
}
{
  real_time:16138237.421680 global_time:11751, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:46596, ticks:44243
rax:0xffffffffffffff92 rbx:0x7fffd22dcfa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x89 rdi:0x7ffff00a427c rbp:0x7fffd22dcfa0 rsp:0x7fffd22dcd40 r8:0x0 r9:0xffffffff r10:0x7fffd23dccb0 r11:0x246 r12:0x7ffff00a4274 r13:0x7ffff00a4228 r14:0x7ffff00a427c r15:0x7fffd23dcc20 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7fffd23dd700 gs_base:0x0
}
{
  real_time:16138237.421876 global_time:11752, event:`SYSCALLBUF_FLUSH' tid:46596, ticks:44508
  { syscall:'futex', ret:0x0, size:0x14 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:16138237.421886 global_time:11753, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:46596, ticks:44508
rax:0xffffffffffffffda rbx:0x7fffd22dcfa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x89 rdi:0x7ffff00a427c rbp:0x7fffd22dcfa0 rsp:0x7fffd22dcd40 r8:0x0 r9:0xffffffff r10:0x7fffd23dccb0 r11:0x246 r12:0x7ffff00a4274 r13:0x7ffff00a4228 r14:0x7ffff00a427c r15:0x7fffd23dcc20 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7fffd23dd700 gs_base:0x0
}
{
  real_time:16138237.421898 global_time:11754, event:`SYSCALLBUF_RESET' tid:46596, ticks:44508
}
{
  real_time:16138237.428152 global_time:11755, event:`SYSCALL: poll' (state:EXITING_SYSCALL) tid:46652, ticks:17337850
rax:0x1 rbx:0x7fffd0ab8fa0 rcx:0xffffffffffffffff rdx:0x190 rsi:0x1 rdi:0x7fffd0ba8258 rbp:0x7fffd0ab8fa0 rsp:0x7fffd0ab8d40 r8:0x0 r9:0x89aea888 r10:0x12e r11:0x246 r12:0x0 r13:0x7fffd0ba8258 r14:0x712040de r15:0x712040e6 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x7 fs_base:0x7fffd0bb9700 gs_base:0x0
  { tid:46652, addr:0x7fffd0ba8258, length:0x8 }
}
{
  real_time:16138237.428493 global_time:11756, event:`SYSCALLBUF_FLUSH' tid:46652, ticks:17347329
  { syscall:'recvfrom', ret:0x14, size:0x24, desched:1 }
  { syscall:'sendto', ret:0x14, size:0x10, desched:1 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'poll', ret:0x0, size:0x18 }
}
{
  real_time:16138237.428526 global_time:11757, event:`SYSCALL: poll' (state:ENTERING_SYSCALL) tid:46652, ticks:17347329
rax:0xffffffffffffffda rbx:0x7fffd0ab8fa0 rcx:0xffffffffffffffff rdx:0x190 rsi:0x1 rdi:0x7fffd0ba8278 rbp:0x7fffd0ab8fa0 rsp:0x7fffd0ab8d40 r8:0x0 r9:0x89aea888 r10:0x134 r11:0x246 r12:0x0 r13:0x7fffd0ba8278 r14:0x71204086 r15:0x7120408e rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x7 fs_base:0x7fffd0bb9700 gs_base:0x0
}
{
  real_time:16138237.428539 global_time:11758, event:`SYSCALLBUF_RESET' tid:46652, ticks:17347329
}
{
  real_time:16138237.450134 global_time:11759, event:`SYSCALL: recvfrom' (state:EXITING_SYSCALL) tid:46669, ticks:2377
rax:0xfffffffffffffff7 rbx:0x7170404e rcx:0xffffffffffffffff rdx:0x2000 rsi:0x7170405e rdi:0x60 rbp:0x0 rsp:0x7fffb3afed30 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0x0 r13:0x7170405e r14:0x0 r15:0x2000 rip:0x7000000e eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x2d fs_base:0x7fffb3bff700 gs_base:0x0
  { tid:46669, addr:0x7170404e, length:0x8 }
  { tid:46669, addr:0x7170405e, length:0x2000 }
}
{
  real_time:16138237.450155 global_time:11760, event:`SYSCALLBUF_ABORT_COMMIT' tid:46669, ticks:2377
}
{
  real_time:16138237.450609 global_time:11761, event:`SYSCALL: rrcall_notify_syscall_hook_exit' (state:ENTERING_SYSCALL) tid:46669, ticks:2384
rax:0xffffffffffffffda rbx:0x7fffb3afefa0 rcx:0xffffffffffffffff rdx:0x2000 rsi:0x7fffb3bee5b0 rdi:0x60 rbp:0x7fffb3afefa0 rsp:0x7fffb3afee20 r8:0x0 r9:0x0 r10:0x0 r11:0x246 r12:0xfffffffffffffff7 r13:0x7fffb3bee530 r14:0x7fffb3bff680 r15:0x7ffff403e210 rip:0x70000005 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3ea fs_base:0x7fffb3bff700 gs_base:0x0
}
{
  real_time:16138237.450634 global_time:11762, event:`SYSCALLBUF_RESET' tid:46669, ticks:2384
}
=== Start rr backtrace:
../../build/rr/bin/rr(_ZN2rr13dump_rr_stackEv+0x5d)[0x555555b348e5]
../../build/rr/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0xf8)[0x55555591d9a6]
../../build/rr/bin/rr(+0x402ec4)[0x555555956ec4]
../../build/rr/bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x66)[0x55555595714c]
../../build/rr/bin/rr(_ZN2rr9Registers22compare_register_filesEPNS_10ReplayTaskEPKcRKS0_S4_S6_NS_16MismatchBehaviorE+0x1cf)[0x555555a578d9]
../../build/rr/bin/rr(_ZN2rr10ReplayTask13validate_regsEj+0x219)[0x555555a806b1]
../../build/rr/bin/rr(+0x51005e)[0x555555a6405e]
../../build/rr/bin/rr(_ZN2rr13ReplaySession13enter_syscallEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x55b)[0x555555a645d9]
../../build/rr/bin/rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x127)[0x555555a69949]
../../build/rr/bin/rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0x1d7)[0x555555a6b1a9]
../../build/rr/bin/rr(_ZN2rr13ReplaySession11replay_stepENS_10RunCommandE+0x4f)[0x555555a60227]
../../build/rr/bin/rr(+0x50a847)[0x555555a5e847]
../../build/rr/bin/rr(+0x50b036)[0x555555a5f036]
../../build/rr/bin/rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x422)[0x555555a5fe88]
../../build/rr/bin/rr(main+0x289)[0x55555596a960]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff792f083]
../../build/rr/bin/rr(_start+0x2e)[0x55555585e7ee]
=== End rr backtrace
Launch gdb with
  gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:20990' /usr/lib/jvm/java-11-openjdk-amd64/bin/java
rocallahan commented 9 months ago

Does the problem go away with rr record -n? You probably need to disable syscallbuf because record_syscall.cc won't see buffered syscalls and therefore won't be able to inject faults into them.

Otherwise, I don't know, but you should be able to debug it yourself.

rocallahan commented 9 months ago

Also, FWIW, instead of using __NR_write you should use Arch::write etc, otherwise your code won't work correctly when x86-64 rr records a 32-bit tracee. I assume this doesn't actually matter to you in practice.

zouyonghao commented 9 months ago

It works with rr record -n! Thank you so much @rocallahan !

I'll also follow your advice to write more compatible code.