Open tau0 opened 5 years ago
For me it seems to be very similar to https://github.com/google/sanitizers/issues/788, but I'm not sure. I.e. we killed the binary during data race report.
I think I see a potential problem in the stack trace.
In the fork interceptor tsan locks report_mtx:
TSAN_INTERCEPTOR(int, fork, int fake) {
if (in_symbolizer())
return REAL(fork)(fake);
SCOPED_INTERCEPTOR_RAW(fork, fake);
ForkBefore(thr, pc);
...
void ForkBefore(ThreadState *thr, uptr pc) {
ctx->thread_registry->Lock();
ctx->report_mtx.Lock();
}
It was assumed that fork does not call any instrumented user code.
But in this case fork calls folly::detail::(anonymous namespace)::AtForkList::child() which is instrumented and triggers a race, which tries to lock report_mtx again, which deadlocks.
There may be several potential solutions. We may try to intercept pthread_atfork and run the callbacks ourselves. Or register own callback and try to make tsan wrappers around the fork the innermost ones. Or we may try to set after_multithreaded_fork and other ignores earlier, so that we don't try to report the race in the folly callbacks. I am not sure which one is better.
EDIT: Ignore this comment. Incorrectly read the code.
I am seeing this issue as well. @dvyukov do you have any ideas on how to proceed? Is there any data I could provide to help?
Maybe this https://github.com/llvm/llvm-project/commit/be41a98ac222f33ed5558d86e1cede67249e99b5 fixes it? Looks similar.
Some context, we have an alarm which kills the binary after 240s with std::exit(1), but it seems that the binary is being stuck during this kill.
It looks very suspicious that we were stuck in trying to acquire some lock in ReportRace, maybe it is even a reason why we hit a timeout, see trace:
1) std::exit(1) after 240s 2) Wait for some time ~400s 3) get a stack trace and kill it with SIGKILL.
PS: Compiled with clang 8.0.