google / sanitizers

AddressSanitizer, ThreadSanitizer, MemorySanitizer
Other
11.55k stars 1.04k forks source link

called_from_lib could be more useful... #949

Open WallStProg opened 6 years ago

WallStProg commented 6 years ago

It appears that the called_from_lib suppression only triggers if the named library directly calls an intercepted function.

In some cases, it would be very helpful if it ignored certain intevening libs (e.g., system libs) in the stack trace.

For instance:

WARNING: ThreadSanitizer: data race (pid=4118)
  Read of size 1 at 0x7b0800006558 by thread T4:
    #0 memcmp /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:822 (mx_snapshot+0x458cc2)
    #1 std::char_traits<char>::compare(char const*, char const*, unsigned long) /shared/buildtest/gcc/5.3.0/gcc-5.3.0/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/char_traits.h:262 (libstdc++.so.6+0xd0da9)
    #2 std::string::compare(char const*) const /shared/buildtest/gcc/5.3.0/gcc-5.3.0/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:1400 (libstdc++.so.6+0xd0da9)
    #3 zmq::session_base_t::start_connecting(bool) /home/btorpey/work/libzmq/4.2.3/src/session_base.cpp:534:24 (libzmq.so.5.1.3+0xf29b4)
    #4 zmq::session_base_t::process_plug() /home/btorpey/work/libzmq/4.2.3/src/session_base.cpp:315:9 (libzmq.so.5.1.3+0xf283f)
    #5 zmq::object_t::process_command(zmq::command_t&) /home/btorpey/work/libzmq/4.2.3/src/object.cpp:90:9 (libzmq.so.5.1.3+0xc7e20)
    #6 zmq::io_thread_t::in_event() /home/btorpey/work/libzmq/4.2.3/src/io_thread.cpp:85:30 (libzmq.so.5.1.3+0xb21ce)
    #7 non-virtual thunk to zmq::io_thread_t::in_event() /home/btorpey/work/libzmq/4.2.3/src/io_thread.cpp (libzmq.so.5.1.3+0xb22cc)
    #8 zmq::epoll_t::loop() /home/btorpey/work/libzmq/4.2.3/src/epoll.cpp:188:29 (libzmq.so.5.1.3+0xae624)
    #9 zmq::epoll_t::worker_routine(void*) /home/btorpey/work/libzmq/4.2.3/src/epoll.cpp:203:24 (libzmq.so.5.1.3+0xae328)
    #10 thread_routine(void*) /home/btorpey/work/libzmq/4.2.3/src/thread.cpp:106:9 (libzmq.so.5.1.3+0x119993)

Although the "called_from_lib:libzmq.so" suppression filters direct calls to e.g. memcmp from libzmq.so, in this case libstdc++ is in the way, and so the suppression doesn't trigger.

dvyukov commented 6 years ago

Normal race suppressions can do this. called_from_lib is meant specifically as more limited, but much faster alternative to general suppressions for limited cases where it is applicable. called_from_lib doesn't unwind stack on every memcmp and does not know about upper frames, this accounts for both its speed and its limitations.

So I would suggest to add a normal suppression for remaining cases on top of the called_from_lib.

WallStProg commented 6 years ago

Thanks, Dmitry.

Well, the problem with using normal suppressions in this case is that I have a lot of them.

So, I tried setting ignore_noninstrumented_modules=1 which actually does the trick, but with one odd side-effect.

It correctly ignores calls to interceptors from uninstrumented code (e.g., libstdc++), but it also misses some races that it should pick up, where the call is coming directly from an instrumented module.

The one thing that all those missing reports have in common is a block similar to the following:

  As if synchronized via sleep:
    #0 nanosleep /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:355 (tpsdaemon+0x45658a)
    #1 Transact::MillisecondSleep(int) /home/btorpey/work/transact/4.0.0/src/api/src/MXMisc.cpp:101:12 (libTransact.so.4+0x22841)
    #2 waitForShutdown() /home/btorpey/work/transact/4.0.0/src/server/txnpd/src/txnd.cpp:271:7 (tpsdaemon+0x51f18a)
    #3 main /home/btorpey/work/transact/4.0.0/src/server/txnpd/src/txnd.cpp:107:4 (tpsdaemon+0x51e764)

OTOH, there are also plenty of reported races with similar "As if synchronized..." blocks, so the missing reports are not specific to that.

If you have any ideas what may be causing the missing reports, I'd be glad to know -- thanks!

WallStProg commented 6 years ago

Hmmm ... I think I need to take back the "does the trick" comment -- it looks like setting ignore_noninstrumented_modules=1 does considerably more than I was expecting/hoping.

With that setting, I'm missing quite a few legitimate warnings (in fact, I'm missing all of them in some cases). I've verified that the immediate parent library is built w/TSAN:

nm /path/to/libxyz.so | grep tsan

I've read https://reviews.llvm.org/D28264 and https://reviews.llvm.org/D29041, so I thought I understood it, but perhaps not? Or is there possibly a bug?

FWIW I'm running r331301 on CentOS 6.9, using gcc 5.3.0 to build clang. I've run a bunch of A/B tests changing only that setting and compared output.

dvyukov commented 6 years ago

I never used ignore_noninstrumented_modules, so +@kubamracek

With that setting, I'm missing quite a few legitimate warnings (in fact, I'm missing all of them in some cases).

Does these legitimate warnings involve non-instrumented libraries? Or you are not seeing any race reports at all?

dvyukov commented 6 years ago

Well, the problem with using normal suppressions in this case is that I have a lot of them.

Suppress race:zmq?

WallStProg commented 6 years ago

With that setting, I'm missing quite a few legitimate warnings (in fact, I'm missing all of them in some cases).

Does these legitimate warnings involve non-instrumented libraries? Or you are not seeing any race reports at all?

The missing reports are for instrumented libs. I did an A/B test changing only ignore_noninstrumented_modules (identical binaries). With ignore_noninstrumented_modules=1 I'm missing reports from non-instrumented code (which is intended/expected), but also many reports from instrumented libs which were reported with ignore_noninstrumented_modules=0.

Some examples of missing reports include:

WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=14344)
  Cycle in lock order graph: M273 (0x7b0c00006b40) => M274 (0x7b0c00006b70) => M273

  Mutex M274 acquired here while holding mutex M273 in thread T8:
    #0 pthread_mutex_lock /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4027 (tpsdaemon+0x444c77)
    #1 wlock_lock /shared/work/OpenMAMA/6.2.1/common/c_cpp/src/c/wlock.c:66:12 (libmama.so+0xe66c0)
    #2 mamaEnvSession_destroyTimer /home/btorpey/work/transact/4.0.0/src/common/Middleware/MamaAdapter/mme/mamaEnvSession.c:549:5 (libmme.so+0x4acc)
    #3 mamaEnv_destroyTimer /home/btorpey/work/transact/4.0.0/src/common/Middleware/MamaAdapter/mme/mamaEnvSession.c:295 (libmme.so+0x4acc)
    ...    

In above example I confirmed that libmama.so is instrumented:

$ nm /home/btorpey/install/OpenMAMA/6.2.1/tsan/lib/libmama.so | grep tsan
                 U __tsan_func_entry
                 U __tsan_func_exit
                 U __tsan_init
                 ...

Another example of a report that does not appear with ignore_noninstrumented_modules=1 is:

WARNING: ThreadSanitizer: data race (pid=14344)
  Read of size 4 at 0x7b3c0000034c by main thread:
    #0 Transact::StatusPubImpl::getState(Transact::SERVERSTATE&) const /home/btorpey/work/transact/4.0.0/src/common/Middleware/src/StatusPubImpl.cpp:387:12 (libTransact.so.4+0x4e0d9)
    #1 TransactX::TxnReceiver::updatePublishedStateIfChanged() /home/btorpey/work/transact/4.0.0/src/server/txnpd/src/TxnReceiver.cpp:5635:35 (tpsdaemon+0x503a53)
    #2 TransactX::TxnReceiver::init(char const*, char const*, TransactX::FAILOVERMODE, TransactX::DISPATCHMODE, int, char const*, char const*, bool, char const*, TransactX::STANDBYTYPE, int, bool, char const*, char const*, bool, int) /home/btorpey/work/transact/4.0.0/src/server/txnpd/src/TxnReceiver.cpp:828:10 (tpsdaemon+0x500831)
    #3 main /home/btorpey/work/transact/4.0.0/src/server/txnpd/src/txnd.cpp:89:19 (tpsdaemon+0x51e75b)

Similarly:

$ nm /home/btorpey/work/transact/4.0.0/lib/libTransact.so | grep tsan
                 U __tsan_atomic32_fetch_add
                 U __tsan_atomic8_load
                 U __tsan_func_entry
                 ...
kubamracek commented 6 years ago

With that setting, I'm missing quite a few legitimate warnings (in fact, I'm missing all of them in some cases).

Hm, it looks like that the detection for instrumented/non-instrumented modules is only implemented for macOS. Effectively, on Linux all modules (excepts the main executable) are treated as non-uninstrumented (which is a bug of course).

Side note: Watch out when using non-instrumented libstdc++, as you might miss important synchronization in std::shared_ptr and related classes. @dvyukov, is the recommendation to build your own instrumented version of libstdc++ to use TSan on C++ code?

WallStProg commented 6 years ago

Well, the problem with using normal suppressions in this case is that I have a lot of them.

Suppress race:zmq?

Unfortunately, that is too broad -- it risks suppressing real problems in my code just because zmq code is somewhere in the stack.

FWIW I actually have scripts to post-process TSAN output, which will (among other things) filter out reports that do not have "my" code in the stack. But even with that the number of spurious reports is impractical to deal with.

dvyukov commented 6 years ago

The missing reports are for instrumented libs.

Now looking at the code I think I see what happens: detection of instrumented libraries was just never implemented for Linux: https://reviews.llvm.org/D28263

Does adding all libraries explicitly to called_from_lib help? You can get full list from /proc/PID/maps.

dvyukov commented 6 years ago

Unfortunately, that is too broad -- it risks suppressing real problems in my code just because zmq code is somewhere in the stack.

Well, the goal was always to be good at detecting races, rather than at suppressing and ignoring them in complex ways. I am not sure it's even possible to fully achieve "we have tons of non-instrumented code and massive amounts of true races and we want to precisely spot-ignore all of that and get all races in the remaining code without any false positives nor false negatives"...

WallStProg commented 6 years ago

Side note: Watch out when using non-instrumented libstdc++, ...

I hear you, but at this point I have the opposite problem. First I need to reduce the number of spurious reports to a manageable level.

I do notice that the Darwin folks have the same concern.

Unfortunately libcxx is not a great solution for my case -- we have some 3rd-party libs (Oracle OCCI) that require e.g. old-style std::string, and for which we don't have source.

WallStProg commented 6 years ago

Well, the goal was always to be good at detecting races, rather than at suppressing and ignoring them in complex ways. I am not sure it's even possible to fully achieve "we have tons of non-instrumented code and massive amounts of true races and we want to precisely spot-ignore all of that and get all races in the remaining code without any false positives nor false negatives"...

Yup -- it's a problem, and one we've been working on for a while. Which is why we've developed scripts to post-process output from TSAN (also ASAN, LSAN and valgrind) to reduce spurious reports. It's not a perfect process -- we're just trying to get close enough to make it usable.

For now though, I'd be very happy if ignore_noninstrumented_modules=1 worked as intended. Perhaps it does on Darwin, but on Linux at least it appears to be broken.

WallStProg commented 6 years ago

Now looking at the code I think I see what happens: detection of instrumented libraries was just never implemented for Linux: https://reviews.llvm.org/D28263

Well, that explains things ;-)

Does adding all libraries explicitly to called_from_lib help?

I'll give that a try -- thanks!