[TSAN] pthread_atfork callbacks are causing deadlocks.

tau0 commented 5 years ago

Some context, we have an alarm which kills the binary after 240s with std::exit(1), but it seems that the binary is being stuck during this kill.

It looks very suspicious that we were stuck in trying to acquire some lock in ReportRace, maybe it is even a reason why we hit a timeout, see trace:

1) std::exit(1) after 240s 2) Wait for some time ~400s 3) get a stack trace and kill it with SIGKILL.

   Timing out. Killing processes in state: S (sleeping)
    Attempting to pull stack traces: 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/usr/local/fbcode/platform007/lib/libthread_db.so.1".
    0x000056399741cdd6 in __sanitizer::BlockingMutex::Lock() ()

    Thread 1 (Thread 0x7fb1459001c0 (LWP 2314618)):
    #0  0x000056399741cdd6 in __sanitizer::BlockingMutex::Lock() ()
    #1  0x0000563997487350 in __tsan::ReportRace(__tsan::ThreadState*) ()
    #2  0x000056399748c22d in __tsan_report_race_thunk ()
    #3  0x000056399747ff11 in __tsan_write8 ()
    #4  0x00005639958f2868 in folly::threadlocal_detail::StaticMeta<void, void>::onForkChild() ()
    #5  0x00005639958e7aac in void folly::detail::function::FunctionTraits<void ()>::callSmall<void (*)()>(folly::detail::function::Data&) ()
    #6  0x00005639970e120c in folly::detail::(anonymous namespace)::AtForkList::child() ()
    #7  0x00007fb14333a602 in fork () from /usr/local/fbcode/platform007/lib/libc.so.6
    #8  0x00005639974339b3 in fork ()
    #9  0x000056399704c90a in folly::Subprocess::spawnInternal(std::unique_ptr<char const* [], std::default_delete<char const* []> >, char const*, folly::Subprocess::Options&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, int) ()
    #10 0x000056399704bb96 in folly::Subprocess::spawn(std::unique_ptr<char const* [], std::default_delete<char const* []> >, char const*, folly::Subprocess::Options const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*) ()
    #11 0x000056399704b9e4 in folly::Subprocess::Subprocess(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, folly::Subprocess::Options const&, char const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*) ()
    #12 0x0000563995e90162 in facebook::logdevice::IntegrationTestUtils::Node::start() ()
    #13 0x0000563995e8562e in facebook::logdevice::IntegrationTestUtils::Cluster::start(std::vector<short, std::allocator<short> >) ()
    #14 0x0000563995e4ecde in (anonymous namespace)::TailRecordIntegrationTest_SequencerFailOver_Test::TestBody() ()
    #15 0x00005639974b8ebf in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
    #16 0x00005639974aa6ab in testing::Test::Run() ()
    #17 0x00005639974aa801 in testing::TestInfo::Run() ()
    #18 0x00005639974aa914 in testing::TestCase::Run() ()
    #19 0x00005639974ab034 in testing::internal::UnitTestImpl::RunAllTests() ()
    #20 0x00005639974ab2b5 in testing::UnitTest::Run() ()
    #21 0x0000563996f94c97 in main ()

PS: Compiled with clang 8.0.

tau0 commented 5 years ago

For me it seems to be very similar to https://github.com/google/sanitizers/issues/788, but I'm not sure. I.e. we killed the binary during data race report.

dvyukov commented 5 years ago

I think I see a potential problem in the stack trace.

In the fork interceptor tsan locks report_mtx:

TSAN_INTERCEPTOR(int, fork, int fake) {
  if (in_symbolizer())
    return REAL(fork)(fake);
  SCOPED_INTERCEPTOR_RAW(fork, fake);
  ForkBefore(thr, pc);
  ...

void ForkBefore(ThreadState *thr, uptr pc) {
  ctx->thread_registry->Lock();
  ctx->report_mtx.Lock();
}

It was assumed that fork does not call any instrumented user code.

But in this case fork calls folly::detail::(anonymous namespace)::AtForkList::child() which is instrumented and triggers a race, which tries to lock report_mtx again, which deadlocks.

There may be several potential solutions. We may try to intercept pthread_atfork and run the callbacks ourselves. Or register own callback and try to make tsan wrappers around the fork the innermost ones. Or we may try to set after_multithreaded_fork and other ignores earlier, so that we don't try to report the race in the folly callbacks. I am not sure which one is better.

tau0 commented 5 years ago

EDIT: Ignore this comment. Incorrectly read the code.

matt-c-clark commented 4 years ago

I am seeing this issue as well. @dvyukov do you have any ideas on how to proceed? Is there any data I could provide to help?

dvyukov commented 4 years ago

Maybe this https://github.com/llvm/llvm-project/commit/be41a98ac222f33ed5558d86e1cede67249e99b5 fixes it? Looks similar.

google / sanitizers

[TSAN] pthread_atfork callbacks are causing deadlocks. #1116