rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.15k stars 586 forks source link

Bug hard to hunt down under rr even with --chaos #2128

Open emilio opened 6 years ago

emilio commented 6 years ago

I remember @rocallahan was interested on this kind of thing, so I'm opening an issue.

See: https://github.com/servo/ipc-channel/pull/179

When I run the test on my machine, like:

$ true; while [ $? -eq 0 ];do RUST_BACKTRACE=1 ./target/release/deps/ipc_channel-6611382eff9c2dbb stress ; done

I manage to get something like:

``` failures: ---- test::stress_in_process stdout ---- thread 'test::stress_in_process' panicked at 'recv: Io(Error { repr: Os { code: 104, message: "Connection reset by peer" } })', /checkout/src/libcore/result.rs:906:4 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. stack backtrace: 0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49 1: std::sys_common::backtrace::_print at /checkout/src/libstd/sys_common/backtrace.rs:69 2: std::panicking::default_hook::{{closure}} at /checkout/src/libstd/sys_common/backtrace.rs:58 at /checkout/src/libstd/panicking.rs:381 3: std::panicking::default_hook at /checkout/src/libstd/panicking.rs:391 4: std::panicking::rust_panic_with_hook at /checkout/src/libstd/panicking.rs:577 5: std::panicking::begin_panic at /checkout/src/libstd/panicking.rs:538 6: std::panicking::begin_panic_fmt at /checkout/src/libstd/panicking.rs:522 7: rust_begin_unwind at /checkout/src/libstd/panicking.rs:498 8: core::panicking::panic_fmt at /checkout/src/libcore/panicking.rs:71 9: core::result::unwrap_failed 10: ipc_channel::test::stress_in_process 11: _ZN42_$LT$F$u20$as$u20$test..FnBox$LT$T$GT$$GT$8call_box17hee77675d0689c9f5E.llvm.313E8C1E at /checkout/src/libtest/lib.rs:1480 at /checkout/src/libcore/ops/function.rs:223 at /checkout/src/libtest/lib.rs:141 12: __rust_maybe_catch_panic at /checkout/src/libpanic_unwind/lib.rs:99 failures: test::stress_in_process test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 69 filtered out ```

After 20 / 60 iterations consistently.

So the next logical step was to try hunting it with rr, using:

$ true; while [ $? -eq 0 ];do RUST_BACKTRACE=1 rr record --chaos ./target/debug/deps/ipc_channel-0a671e7e49351578 stress && rm -rf ~/.local/share/rr/ipc_channel-*; done

So far I haven't succeeded yet (it's been 90 iterations already, which take more than a minute when recording, both on debug and release).

Just filing to see if it rings a bell or you find it interesting, it's mostly a bunch of socketpair / sendmsg / recvmsg syscalls.

I'm running rr as of 244151f05a0a0c9315675c6278580a98cb97db69.

Also, something I've noticed about this test-case is that recording it under rr is much slower than not doing it even in release, but that's kinda expected given it's very syscall-intensive.

rocallahan commented 6 years ago

If you're doing a lot of ping-ponging between processes then rr's slowness is probably because we make context switches a lot more expensive, and that is very difficult to address.

emilio commented 6 years ago

It should only be across threads there (though it likely still applies).

rocallahan commented 6 years ago

OK, threads have the same issue.

Let me know when you've figured out the bug and maybe I can determine why chaos mode couldn't reproduce it, and fix it.