odyaka341 / thread-sanitizer

Automatically exported from code.google.com/p/thread-sanitizer
0 stars 0 forks source link

Live lock in __tsan_thread_start_func() #79

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Application instrumented by TSAN hangs while starting. I have two threads (7220 
and 7234) looping in __tsan_thread_start_func(), precisely in
while ((tid = atomic_load(&p->tid, memory_order_acquire)) == 0)
  pthread_yield();
Seen by perf as
   23.47 :        3f610:       callq  355e0 <pthread_yield@plt>
   61.22 :        3f615:       mov    0x10(%r15),%rbx
   15.31 :        3f619:       test   %ebx,%ebx
    0.00 :        3f61b:       je     3f610 <__tsan_thread_start_func+0x40>

Another threads are sleeping, see attached stack traces.   

We use Linux 3.8.13 and clang 3.4.

Original issue reported on code.google.com by e...@tbricks.com on 1 Oct 2014 at 3:32

Attachments:

GoogleCodeExporter commented 9 years ago
The issue is in this thread:

 00007f1f230f0a34  __sanitizer::internal_read(int, void*, unsigned long) + 0x14
 00007f1f230ed234  __tsan::SymbolizeCodeAddr2Line(unsigned long) + 0xf4
 00007f1f230eb87f  __tsan::SymbolizeCode(unsigned long) + 0x32f
 00007f1f230e75ee  __tsan::SymbolizeStack(__tsan::StackTrace const&) + 0x4e
 00007f1f230e7e7a  __tsan::ScopedReport::AddThread(__tsan::ThreadContext const*) + 0x1ba
 00007f1f230e9503  __tsan::ReportRace(__tsan::ThreadState*) + 0x723
 00007f1f230ed67e  ?? (eiva_tip)
 00007f1f230e53ad  __tsan::MemoryAccessImpl(__tsan::ThreadState*, unsigned long, int, bool, bool, unsigned long long*, __tsan::Shadow) + 0x61d
 00007f1f230eacb1  __tsan::MemoryAccessRange(__tsan::ThreadState*, unsigned long, unsigned long, unsigned long, bool) + 0x1d1
 00007f1f230e0ff0  __tsan::MemoryRangeFreed(__tsan::ThreadState*, unsigned long, unsigned long, unsigned long) + 0x50
 00007f1f230db572  __tsan::user_free(__tsan::ThreadState*, unsigned long, void*) + 0x1c2
 00007f1f2308e647  __interceptor_free + 0xe7
 00007f1f1c72312a  witem_free + 0x8a
 00007f1f1c7257bf  worker_main + 0x12f
 00007f1f23092665  __tsan_thread_start_func + 0x95
 00007f1f23047127  tb_start_thread + 0x97
 00007f1f1c0ce9d1  mmap@@GLIBC_2.2.5 + 0x79d1
 00007f1f1b471b6d  __tls_get_addr@@GLIBC_2.3 + 0xe8b6d

It tries to report a data race, but hangs on read syscall.

What does it do? Does it just block in the read infinitely? Or loop? Why can't 
in read from addr2line?

Original comment by dvyu...@google.com on 1 Oct 2014 at 4:17

GoogleCodeExporter commented 9 years ago
The thread is in S state, so it is blocked. We've noticed, that libpthread
is not relocatable, but it has been loaded with different address, than
expected — it is under investigation of my colleague.

Original comment by e...@tbricks.com on 2 Oct 2014 at 1:08

GoogleCodeExporter commented 9 years ago
What happens if you run addr2line on the binary manually and symbolize some pc?

Original comment by dvyu...@google.com on 2 Oct 2014 at 1:15

GoogleCodeExporter commented 9 years ago
Reproduced it again with another binary. According to perf thread which
calls internal_read() is not executing, it is always in S state. It tries
to symbolize address 0x7fe6d1239d1c from libc (__libc_start_main+252),
which is:
   0x00007fe6d1239d19 <+249>:   callq  *0x18(%rsp)
   0x00007fe6d1239d1d <+253>:   mov    %eax,%edi

Which is strange. addr2line utility fails to recognize this address as
expected.

Original comment by e...@tbricks.com on 2 Oct 2014 at 3:35

GoogleCodeExporter commented 9 years ago
There seems to be something wrong with your addr2line or the binary.
Unfortunately we don't have a debugging mode that dumps all communication with 
addr2line.
The simplest option is probably to write an addr2line proxy program that dumps 
all communication, and add it to path. Then you will be able to see what tsan 
asks addr2line, what addr2line answers and when it hangs.

>addr2line utility fails to recognize this address as expected.
The binary is compiled with -fPIE, so it's loaded at high 0x7f addresses. 
0x00007fe6d1239d19 is not a real addresses in the binary, it's an address plus 
executable_load_address (which varies from run to run). If you find the same 
piece of code in objdump, then it will tell you the real address in binary.

Original comment by dvyu...@google.com on 2 Oct 2014 at 4:42

GoogleCodeExporter commented 9 years ago
Another option is to run the program with TSAN_OPTIONS="symbolize=0" 
environment variable. It should give you your race reports (w/o symbolization, 
though), and then you can try to manually invoke addr2line with the PC's from 
report (and see if it hangs or not).

Original comment by dvyu...@google.com on 3 Oct 2014 at 1:05

GoogleCodeExporter commented 9 years ago
symbolize=0 doesn't work (still symbolizing), also I don't see it in
http://code.google.com/p/thread-sanitizer/wiki/Flags
Instead I use my dummy addr2line, which only prints "ABRACADABRA\n":

#!/usr/bin/perl -w
use strict;
use IO::Handle;
open(my $fh, '>>', '/home/eiva/addr2line.log') or die;
$fh->autoflush(1);
while (<STDIN>) {
    my $input = $_;
    print $fh "request: $input";
    my $output = "ABRACADABRA\n";
    print "$output";
    print $fh "result: $output";
}
close($fh);

The output is:
request: 0x000000002129
result: ABRACADABRA

Now the program is hanging in internal_read() and addrstr from
SymbolizeCodeAddr2Line() is "0x000000002129\n".

Original comment by e...@tbricks.com on 3 Oct 2014 at 1:30

GoogleCodeExporter commented 9 years ago
You need to output 2 lines:
funcname
file:123
and maybe one more trailing \n, not sure

Original comment by dvyu...@google.com on 3 Oct 2014 at 1:48

GoogleCodeExporter commented 9 years ago
Doesn't help (as well as using proxy addr2line cmd):
request: 0x000000002129
result: witem_free
/tb/builds/thd/sbn/2.6/src/thirdparty/libpthread_workqueue/251/src/src/witem_cac
he.c:64
Again hanging on 0x000000002129.

I tried to check reads/writes with strace, but everything is working just
fine when started under strace. Either it is really some race or something
really weird.

Original comment by e...@tbricks.com on 3 Oct 2014 at 4:18

GoogleCodeExporter commented 9 years ago
Does AddressSanitizer work?
Can you try tip tsan?
Build instructions are here:
https://code.google.com/p/address-sanitizer/wiki/HowToBuild
or you can simply download a week-old clang from:
https://code.google.com/p/address-sanitizer/wiki/HowToBuild#Chrome

Original comment by dvyu...@google.com on 4 Oct 2014 at 6:42

GoogleCodeExporter commented 9 years ago
Works smoothly with clang 3.4.

What do you mean?

Original comment by e...@tbricks.com on 6 Oct 2014 at 6:53

GoogleCodeExporter commented 9 years ago
>What do you mean?
I mean that maybe tsan tip works.

Original comment by dvyu...@google.com on 6 Oct 2014 at 7:36

GoogleCodeExporter commented 9 years ago
If you run:
addr2line -f -i -e /your/binary 0x000000002129
what does it output?

Original comment by dvyu...@google.com on 6 Oct 2014 at 7:37

GoogleCodeExporter commented 9 years ago
I succeeded to use strace and I think I've identified the issue, though the
reason in still not clear.

Main thread 11169 which execs my binary creates a pipe for communication
between symbolizer and other threads:

1412592419.414587 pipe([3, 4])          = 0
1412592419.414660 pipe([5, 6])          = 0
1412592419.414784 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f838a66a090) = 10183
1412592419.422791 close(5) = 0
1412592419.422859 close(4)              = 0
1412592419.422934 nanosleep({0, 10000000}, NULL) = 0
1412592419.433139 wait4(10183, 0x7fff69a3d22c, WNOHANG, NULL) = 0

10183 is symbolizer:

1412592419.422743 set_robust_list(0x7f838a66a0a0, 0x18)              = 0
1412592419.422880 close(1) = 0
1412592419.422962 close(0)              = 0
1412592419.423020 dup2(5, 0)            = 0
1412592419.423079 dup2(4, 1)            = 1
1412592419.423148 close(5)              = 0
1412592419.423215 close(6)              = 0
1412592419.423284 close(3)              = 0
1412592419.423352 close(4)              = 0
...
1412592420.635489 execve("/usr/bin/llvm-symbolizer",
["/usr/bin/llvm-symbolizer", "--default-arch=x86_64"], [/* 36 vars */])
    = 0
...
1412592424.704655 read(0,
"\"/home/eiva/git/tb/build.x86_64-unknown-linux/platform/bin64/eiva_se\"
0x268fc\n", 4096) = 78
1412592424.704986 write(1, "__interceptor_calloc\n??:0:0\n\n", 29) = 29
1412592424.705089 read(0, 0x7fbe8531f000, 4096) = ? ERESTARTSYS (To be
restarted)
1412592431.156982 --- SIGINT (Interrupt) @ 0 (0) ---

Now main thread clones() other threads which communicate with symbolizer
via pipe, e.g. the following corresponds to first read-write pair of
symbolizer above:

1412592424.704820 write(6,
"\"/home/eiva/git/tb/build.x86_64-unknown-linux/platform/bin64/eiva_se\"
0x268fc\n", 78) = 78
1412592424.704954 read(3, "__interceptor_calloc\n??:0:0\n\n", 16384) = 29
1412592424.705120 munmap(0x7f838a099000, 768)         = 0
1412592424.705200 gettimeofday({1412592424, 705215}, NULL) = 0
1412592424.705326 write(2, "==================\n", 19) = 19

One of the threads (10191) hangs with following:

1412592431.157482 write(6,
"\"/home/eiva/git/tb/build.x86_64-unknown-linux/platform/bin64/eiva_se\"
0x2ac94\n", 78) = 78
1412592431.157605 read(3, "", 16384) = 0

Interrupted, now it will try to clone another symbolizer, but it will be
never created.

1412592431.161087 getpid()              = 10182
1412592431.161157 write(2, "==10182==WARNING: Can't read from symbolizer at
fd 3\n", 53) = 53
1412592431.161245 close(3)              = 0
1412592431.161315 close(6)              = 0
1412592431.161377 stat("/usr/bin/llvm-symbolizer", {st_mode=S_IFREG|0755,
st_size=586264, ...}) = 0
1412592431.161489 pipe([3, 6])          = 0
1412592431.161557 pipe([54, 55])        = 0
1412592431.161633 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f838a138090) = 10206
1412592431.278568 close(54)             = 0
1412592431.278631 close(6)              = 0
1412592431.278701 nanosleep({0, 10000000}, NULL) = 0
1412592431.288895 wait4(10206, 0x7f838a0ca12c, WNOHANG, NULL) = 0
1412592431.288970 write(55,
"\"/home/eiva/git/tb/build.x86_64-unknown-linux/platform/bin64/eiva_se\"
0x2ac94\n", 78) = 78
1412592431.289066 read(3,

New symbolizer thread hangs too:

1412592431.278454 set_robust_list(0x7f838a1380a0, 0x18) = 0
1412592431.278836 futex(0x7f8392e86f20, FUTEX_WAIT, 2, NULL

So the issue is in deadlock between symbolizer thread and its client:
symbolizer is blocked on reading from pipe() and client waits the data from
symbolizer on another pipe. We see that client has written data, but
nevertheless symbolizer is blocked. According to manual read() should read
from pipe min(nbyte, cur_available), so it looks strange and requires
further debugging.

It is:
??
??:0

Original comment by e...@tbricks.com on 6 Oct 2014 at 1:14

GoogleCodeExporter commented 9 years ago
+Alexey, had we made any fixes to llvm-symbolizer since 3.4?

Original comment by dvyu...@google.com on 6 Oct 2014 at 3:14

GoogleCodeExporter commented 9 years ago
I would suggest trying with tip clang first, there well can be some bugs fixed. 
Build instructions are in comment #10.

Original comment by dvyu...@google.com on 24 Oct 2014 at 12:28

GoogleCodeExporter commented 9 years ago
What is tip?
Currently I switched to llvm-3.5 and the issue seems to be fixed. I'll try to 
reproduce it in same environment soon.

Original comment by e...@tbricks.com on 24 Oct 2014 at 12:42

GoogleCodeExporter commented 9 years ago
By tip I mean HEAD or whatever you call the latest revision.
Glad to hear that it works in 3.5

Original comment by dvyu...@google.com on 24 Oct 2014 at 12:50

GoogleCodeExporter commented 9 years ago
Seems to work just fine in 3.5.0. Seems to be related to llvm-symbolizer: when 
I build with 3.5.0 and use symbolizer still have this issue.
Thanks a lot for the help!

Original comment by e...@tbricks.com on 4 Nov 2014 at 2:22