google / sanitizers

AddressSanitizer, ThreadSanitizer, MemorySanitizer
Other
11.46k stars 1.03k forks source link

Does ThreadSanitizer affects signals? #1179

Open psionic12 opened 4 years ago

psionic12 commented 4 years ago

I have a project which uses a user signal to "interrupt" other signals(for details, please check this garbage collection library, I use signals to stop the world and collect).

And I have some test cases, and it all works fine.

However, I use -fsanitize=thread recently to make a further thread-safety test, and one of my case cannot receive my signal any more, and if I disable the sanitizer, the case can receive the signal again.

And since the project is complicated, I tried to make a simple reproducible program but it will capture the signal even when -fsanitize=thread is enabled.

So does -fsanitize=thread have any influence on signals? Or any one can give me some clue on how to debug this?

Here's my sigact:

  struct sigaction act;
  memset(&act, '\0', sizeof(act));
  act.sa_handler = stop_handler;
  sigemptyset(&act.sa_mask);
  act.sa_flags = 0;
  sigaction (MYGC_STOP_SIGNAL, &act, nullptr);

p.s. I tried both Clang and GCC and the result are the same(signal captured when fsanitize disabled, signal missed when fsanitize enabled)

kcc commented 4 years ago

Yes, TSAN plays some tricks with signals. If you can prepare a minimal reproducer we may be able to investigate what's not working.

@dvyukov can you provide a more detailed answer on this?

Clang vs GCC should not matter since the signal handling is in the run-time library, which is shared.

psionic12 commented 4 years ago

There's a gTest case in my project which can 100% reproduce this issue, but I cannot provide a minimal reproducer(tried many ways, but just cannot write a reproducer), any other clues I can offer to help you?

dvyukov commented 4 years ago

What is the piece of code that is supposed to receive the signal but is not receiving it? Is the code actively looping? Or blocked in some syscall?

psionic12 commented 4 years ago

What is the piece of code that is supposed to receive the signal but is not receiving it?

just a signal handler. I'm not sure if you suggest to paste the handler code here.

Is the code actively looping?

I don't understand

Or blocked in some syscall?

Most cases the thread which should receive the signal blocked at a mutex lock, is this the root case?

dvyukov commented 4 years ago

I meant the code that is supposed to be preempted by the signal. What kind of mutex lock? If you show the code and a gdb backtrace, maybe something simpler will pop up. Generally tsan delivers signals only at designated points, which are some/most syscalls and atomic operations. If code loops in an infinite loop, it may never receive the signal.

psionic12 commented 4 years ago

Here's the calling stack when the program hangs.

thread-1
__lll_lock_wait 0x00007fd4b3b1f10d
__GI___pthread_mutex_lock 0x00007fd4b3b18023
pthread_mutex_lock 0x000000000043a9ac
__gthread_mutex_lock gthr-default.h:748
std::mutex::lock std_mutex.h:103
std::lock_guard<std::mutex>::lock_guard std_mutex.h:162
mygc::GarbageCollector::isCompletedDescriptor GarbageCollector.cpp:222
mygc::GcReference::isCompletedDescriptor GcReference.cpp:39
mygc::make_gc<Tester<3, 512>>() gc_ptr.h:292
GCTest_gcTest_Test::TestBody GcTest.cpp:123
void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) move.h:99
void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) move.h:99
testing::Test::Run() move.h:99
testing::TestInfo::Run() move.h:99
testing::TestCase::Run() move.h:99
testing::internal::UnitTestImpl::RunAllTests() move.h:99
bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) move.h:99
bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) move.h:99
testing::UnitTest::Run() move.h:99
RUN_ALL_TESTS() move.h:99
main move.h:99
__libc_start_main 0x00007fd4b29f3b97
_start 0x0000000000421e3a

thread-2
__GI___nanosleep 0x00007fd4b2ab69d0
usleep 0x00007fd4b2ae9877
usleep 0x00000000004462b1
__tsan::BackgroundThread(void*) 0x0000000000485838
start_thread 0x00007fd4b3b156db
clone 0x00007fd4b2af388f

thread-3
futex_wait_cancelable 0x00007fd4b3b1b9f3
__pthread_cond_wait_common 0x00007fd4b3b1b9f3
__pthread_cond_wait 0x00007fd4b3b1b9f3
__tsan::call_pthread_cancel_with_cleanup(int (*)(void*, void*, void*), void*, void*, void*, void (*)(void*), void*) 0x000000000049b148
cond_wait(__tsan::ThreadState*, unsigned long, __tsan::ScopedInterceptor*, int (*)(void*, void*, void*), void*, void*, void*) [clone .constprop.134] 0x0000000000444c1c
pthread_cond_wait 0x0000000000444e89
std::condition_variable::wait(std::unique_lock<std::mutex>&) 0x00007fd4b383c86c
stop_the_world stop_the_world.cpp:53
mygc::GarbageCollector::stopTheWorldLocked GarbageCollector.cpp:78
mygc::GarbageCollector::New GarbageCollector.cpp:41
mygc::GcReference::gcAllocate GcReference.cpp:12
mygc::make_gc<Tester<3, 512>>() gc_ptr.h:294
GCTest_gcTest_Test::TestBody()::$_0::operator()() const GcTest.cpp:105
_ZSt13__invoke_implIvZN18GCTest_gcTest_Test8TestBodyEvE3$_0JEET_St14__invoke_otherOT0_DpOT1_ invoke.h:60
_ZSt8__invokeIZN18GCTest_gcTest_Test8TestBodyEvE3$_0JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS3_DpOS4_ invoke.h:95
std::thread::_Invoker<std::tuple<GCTest_gcTest_Test::TestBody()::$_0> >::_M_invoke<0ul> thread:234
std::thread::_Invoker<std::tuple<GCTest_gcTest_Test::TestBody()::$_0> >::operator()() thread:243
std::thread::_State_impl<std::thread::_Invoker<std::tuple<GCTest_gcTest_Test::TestBody()::$_0> > >::_M_run() thread:186
<unknown> 0x00007fd4b384266f
__tsan_thread_start_func 0x000000000042a7de
start_thread 0x00007fd4b3b156db
clone 0x00007fd4b2af388f

thread-4
futex_wait_cancelable 0x00007fd4b3b1b9f3
__pthread_cond_wait_common 0x00007fd4b3b1b9f3
__pthread_cond_wait 0x00007fd4b3b1b9f3
__tsan::call_pthread_cancel_with_cleanup(int (*)(void*, void*, void*), void*, void*, void*, void (*)(void*), void*) 0x000000000049b148
cond_wait(__tsan::ThreadState*, unsigned long, __tsan::ScopedInterceptor*, int (*)(void*, void*, void*), void*, void*, void*) [clone .constprop.134] 0x0000000000444c1c
pthread_cond_wait 0x0000000000444e89
std::condition_variable::wait(std::unique_lock<std::mutex>&) 0x00007fd4b383c86c
mygc::OldGeneration::scavenge OldGeneration.cpp:96
std::__invoke_impl<void, void (mygc::OldGeneration::*)(), mygc::OldGeneration*> invoke.h:73
std::__invoke<void (mygc::OldGeneration::*)(), mygc::OldGeneration*> invoke.h:95
std::thread::_Invoker<std::tuple<void (mygc::OldGeneration::*)(), mygc::OldGeneration*> >::_M_invoke<0ul, 1ul> thread:234
std::thread::_Invoker<std::tuple<void (mygc::OldGeneration::*)(), mygc::OldGeneration*> >::operator() thread:243
std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (mygc::OldGeneration::*)(), mygc::OldGeneration*> > >::_M_run thread:186
<unknown> 0x00007fd4b384266f
__tsan_thread_start_func 0x000000000042a7de
start_thread 0x00007fd4b3b156db
clone 0x00007fd4b2af388f

thread-5
futex_wait_cancelable 0x00007fd4b3b1b9f3
__pthread_cond_wait_common 0x00007fd4b3b1b9f3
__pthread_cond_wait 0x00007fd4b3b1b9f3
__tsan::call_pthread_cancel_with_cleanup(int (*)(void*, void*, void*), void*, void*, void*, void (*)(void*), void*) 0x000000000049b148
cond_wait(__tsan::ThreadState*, unsigned long, __tsan::ScopedInterceptor*, int (*)(void*, void*, void*), void*, void*, void*) [clone .constprop.134] 0x0000000000444c1c
pthread_cond_wait 0x0000000000444e89
std::condition_variable::wait(std::unique_lock<std::mutex>&) 0x00007fd4b383c86c
mygc::LargeObjects::scavenge LargeObjects.cpp:23
std::__invoke_impl<void, void (mygc::LargeObjects::*)(), mygc::LargeObjects*> invoke.h:73
std::__invoke<void (mygc::LargeObjects::*)(), mygc::LargeObjects*> invoke.h:95
std::thread::_Invoker<std::tuple<void (mygc::LargeObjects::*)(), mygc::LargeObjects*> >::_M_invoke<0ul, 1ul> thread:234
std::thread::_Invoker<std::tuple<void (mygc::LargeObjects::*)(), mygc::LargeObjects*> >::operator() thread:243
std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (mygc::LargeObjects::*)(), mygc::LargeObjects*> > >::_M_run thread:186
<unknown> 0x00007fd4b384266f
__tsan_thread_start_func 0x000000000042a7de
start_thread 0x00007fd4b3b156db
clone 0x00007fd4b2af388f

thread-6
futex_wait_cancelable 0x00007fd4b3b1b9f3
__pthread_cond_wait_common 0x00007fd4b3b1b9f3
__pthread_cond_wait 0x00007fd4b3b1b9f3
__tsan::call_pthread_cancel_with_cleanup(int (*)(void*, void*, void*), void*, void*, void*, void (*)(void*), void*) 0x000000000049b148
cond_wait(__tsan::ThreadState*, unsigned long, __tsan::ScopedInterceptor*, int (*)(void*, void*, void*), void*, void*, void*) [clone .constprop.134] 0x0000000000444c1c
pthread_cond_wait 0x0000000000444e89
std::condition_variable::wait(std::unique_lock<std::mutex>&) 0x00007fd4b383c86c
mygc::YoungGenerationPool::scavenge YoungGenerationPool.cpp:14
std::__invoke_impl<void, void (mygc::YoungGenerationPool::*)(), mygc::YoungGenerationPool*> invoke.h:73
std::__invoke<void (mygc::YoungGenerationPool::*)(), mygc::YoungGenerationPool*> invoke.h:95
std::thread::_Invoker<std::tuple<void (mygc::YoungGenerationPool::*)(), mygc::YoungGenerationPool*> >::_M_invoke<0ul, 1ul> thread:234
std::thread::_Invoker<std::tuple<void (mygc::YoungGenerationPool::*)(), mygc::YoungGenerationPool*> >::operator() thread:243
std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (mygc::YoungGenerationPool::*)(), mygc::YoungGenerationPool*> > >::_M_run thread:186
<unknown> 0x00007fd4b384266f
__tsan_thread_start_func 0x000000000042a7de
start_thread 0x00007fd4b3b156db
clone 0x00007fd4b2af388f

I think you can ignore thread 4,5,6, these threads are supposed to wait.

Thread 1 is the thread which should receive a signal but not, when I pause the program, it waits at std::lock_guard<std::mutex> guard(mGcMutex);, here's the code:

bool mygc::GarbageCollector::isCompletedDescriptor(size_t typeId) {
  std::lock_guard<std::mutex> guard(mGcMutex);
  return getTypeById(typeId)->isCompleted();
}

thread 3 is the thread who send signals, here's the code:

void stop_the_world(const std::set<pthread_t> &threads) {
  auto self = pthread_self();
  for (auto thread : threads) {
    if (thread != self) {
      std::unique_lock<std::mutex> lk(sThreadBlocker);
      gTotalThreads++;
      lk.unlock();
      if (pthread_kill(thread, MYGC_STOP_SIGNAL) < 0) {
        perror("sigaction");
        exit(-1);
      } else {
//        GCLOG("send stop signal to tid: %u", thread);
      }
    }
  }
  {
    std::unique_lock<std::mutex> lk(sThreadBlocker);
    if (sAcknowledgeThreads != gTotalThreads) {
//      GCLOG("waiting for threads to stop");
      sAcknowledgeCondition.wait(lk);
    } else {
//      GCLOG("all thread is already stopped");
    }
//    GCLOG("stop successfully");
  }
}

it send for signals then wait at a conditional variable. Here's the handler:

void stop_handler(int signal) {
//  GCLOG("tid: %u, caught signal: %u", pthread_self(), signal);
  std::unique_lock<std::mutex> lk(sThreadBlocker);
  sAcknowledgeThreads++;
  if (sAcknowledgeThreads == gTotalThreads) {
    sAcknowledgeCondition.notify_all();
//    GCLOG("notify all thread is stopped");
  }
  sBlockerCondition.wait(lk);
//  GCLOG("handler resumed");
}

which will notify the conditional variable and let thread 3 to continue, in this case, the handler was not triggered.

here's my cmake flages:

set (CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -fno-omit-frame-pointer -fsanitize=thread")
set (CMAKE_LINKER_FLAGS_DEBUG "${CMAKE_LINKER_FLAGS_DEBUG} -fno-omit-frame-pointer -fsanitize=thread")

again, with tsan disabled, the signal can arrive without problem. Hope this will help.

dvyukov commented 4 years ago

Thanks for the detailed info. This should probably help:

diff --git a/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc b/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc
index af301a1a689..b8437ecf27c 100644
--- a/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc
+++ b/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc
@@ -4147,7 +4147,7 @@ INTERCEPTOR(int, pthread_mutex_lock, void *m) {
   void *ctx;
   COMMON_INTERCEPTOR_ENTER(ctx, pthread_mutex_lock, m);
   COMMON_INTERCEPTOR_MUTEX_PRE_LOCK(ctx, m);
-  int res = REAL(pthread_mutex_lock)(m);
+  int res = COMMON_INTERCEPTOR_BLOCK_REAL(pthread_mutex_lock)(m);
   if (res == errno_EOWNERDEAD)
     COMMON_INTERCEPTOR_MUTEX_REPAIR(ctx, m);
   if (res == 0 || res == errno_EOWNERDEAD)

But if it's the root cause, then the test that reproduces this should really be as simple as trying to receive a signal inside of pthread_mutex_lock. So it's strange you couldn't reproduce it in a unit test.

We also need to double check that pthread_mutex_lock can't call back any user code (that's prohibited inside of blocking regions). Is it a cancellation point?

psionic12 commented 4 years ago

Hi @dvyukov, thanks for your advise, I'll change conditional variable to signal first, since compile llvm+clang is a little difficult (I used pre-built llvm-project libraries and compilers).

And about

Is it a cancellation point?

I'm not sure you are talking to me, as far as I know the only cancellation point which is related to posix thread is pthread_cond_wait() and this is a little out of my field.

dvyukov commented 4 years ago

I'll change conditional variable to signal first

Yes, tsan will check for pending singals on exit from pthread_mutex_lock, so it should help... I think... there may be some race conditions... not sure. But at the very least signal after sending the signal (not first).

I'm not sure you are talking to me

:) just in general, if we are to make this change we need to be sure that it does not call user code. If it's not a cancellation point, it makes things simpler.

vlovich commented 3 years ago

Hmm... I'm seeing this issue but we don't use pthread mutexes. In my test I register SIGRTMIN + 2 (SIG36) and then signal it via the SYS_rt_tgsigqueueinfo syscall (to transfer some contextual information). I'm seeing that the registered signal handler never runs even though I do see the signal is delivered when I run it under rr, which leads me to believe something in TSAN is swallowing the signal registration.

The SIG36 signal happens within the context of another signal handler which might be the problem, although the fact that SIG36 is delivered leads me to believe it's the sigaction that's somehow the problem.

dvyukov commented 3 years ago

Yes, tsan intercepts sigaction and substitutes the handler to it's own handler.

oushu1fengsihan1 commented 2 years ago

Yes, tsan intercepts sigaction and substitutes the handler to it's own handler.

I register a SIGQUIT handler to exit process immediately but it doesn't work even if the signal is actually passed to the process.

image

With tsan disabled, the handler will be called properly, so I guess this problem may be caused by tsan. But I didn't find anything about signal handler in tsan manual.

Is there any way to keep tsan from intercepting sigaction?

dvyukov commented 2 years ago

Is there any way to keep tsan from intercepting sigaction?

No, there is no. And if we disable sigaction, the program will fail with much worse failure modes as even function calls and memory accesses are not async-signal-safe under tsan. Tsan tries to deliver all signals eventually. But that mechanism may need fixing/improvement for particular programs.

oushu1fengsihan1 commented 2 years ago

Got it. Thx. @dvyukov