heapcheck crashes with illegal instruction (SIGILL)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

Unable to reproduce with a trivial test case, but reproduces on a large program 
of mine which is static-linked with tcmalloc trunk from SVN:

export HEAPCHECK=normal
while my-program ; do sleep 1 ; done
# wait a minute or two for it to run 5-10 times, will eventually fail.

I'm also static linking against boost threads, which may or may not be related.

What is the expected output? What do you see instead?

Heap check should run correctly. Instead, occasionally the program exits with 
SIGILL (Illegal Instruction) after my program finishes but before the heap dump 
reports.

What version of the product are you using? On what operating system?

svn trunk r190

Please provide any additional information below.

Some output from analyzing a core file:

[New LWP 3511]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./tablet-test'.
Program terminated with signal 4, Illegal instruction.
#0  _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
239     ../sysdeps/x86_64/dl-trampoline.S: No such file or directory.

(gdb) thread apply all bt

Thread 2 (LWP 3511):
#0  0x00000000005f3c0d in wait4 (r=0x0, o=1073741824, s=0x7fff5c0ac3dc, 
p=<optimized out>, this=<synthetic pointer>)
    at src/base/linux_syscall_support.h:1874
#1  waitpid (options=1073741824, status=0x7fff5c0ac3dc, pid=<optimized out>, 
this=<synthetic pointer>)
    at src/base/linux_syscall_support.h:1876
#2  ListAllProcessThreads (parameter=<optimized out>, callback=<optimized out>) 
at src/base/linuxthreads.cc:636
#3  0x00000000005e910d in HeapLeakChecker::IgnoreAllLiveObjectsLocked 
(self_stack_top=0x7fff5c0addd4) at src/heap-checker.cc:1303
#4  0x00000000005e94c3 in HeapLeakChecker::DoNoLeaks (this=0x1663840, 
should_symbolize=HeapLeakChecker::SYMBOLIZE)
    at src/heap-checker.cc:1754
#5  0x00000000005e9c14 in HeapLeakChecker::DoMainHeapCheck () at 
src/heap-checker.cc:2154
#6  0x00000000005e9d75 in HeapLeakChecker_AfterDestructors () at 
src/heap-checker.cc:2301
#7  0x00007f72f3996901 in __run_exit_handlers (status=0, listp=0x7f72f3d13688, 
run_list_atexit=true) at exit.c:78
#8  0x00007f72f3996985 in __GI_exit (status=<optimized out>) at exit.c:100
#9  0x00007f72f397c774 in __libc_start_main (main=0x4ad8c0 <main(int, char**)>, 
argc=1, ubp_av=0x7fff5c0adf88, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff5c0adf78) at libc-start.c:258
#10 0x00000000004afdf9 in _start ()

Thread 1 (LWP 3513):
#0  _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
#1  0x00007f72f47577e5 in _dl_fixup (l=<optimized out>, reloc_arg=<optimized 
out>) at ../elf/dl-runtime.c:127
#2  0x00007f72f475e235 in _dl_runtime_resolve () at 
../sysdeps/x86_64/dl-trampoline.S:42
#3  0x00000000005f3e69 in ListerThread (args=0x7fff5c0ac380) at 
src/base/linuxthreads.cc:263
#4  0x00000000005f37f2 in sys_clone (child_tidptr=0x0, newtls=0x0, 
parent_tidptr=0x0, 
    arg=<error reading variable: Cannot access memory at address 0xffffffffffffffe8>, flags=8390400, child_stack=0xffffffffffffefe8, 
    fn=0xffffffff) at ./src/base/linux_syscall_support.h:1150
#5  local_clone (fn=0xffffffff, fn@entry=<error reading variable: Cannot access 
memory at address 0x8>, 
    arg=<error reading variable: Cannot access memory at address 0xffffffffffffffe8>) at src/base/linuxthreads.cc:115

(gdb) disas
Dump of assembler code for function _dl_x86_64_restore_sse:
   0x00007f72f475e8c0 <+0>:     cmpl   $0x0,0x20e6f1(%rip)        # 0x7f72f496cfb8
   0x00007f72f475e8c7 <+7>:     js     0x7f72f475e91a <_dl_x86_64_restore_sse+90>
=> 0x00007f72f475e8c9 <+9>:     vmovdqa %fs:0x80,%ymm0
   0x00007f72f475e8d3 <+19>:    vmovdqa %fs:0xa0,%ymm1
   0x00007f72f475e8dd <+29>:    vmovdqa %fs:0xc0,%ymm2
   0x00007f72f475e8e7 <+39>:    vmovdqa %fs:0xe0,%ymm3
   0x00007f72f475e8f1 <+49>:    vmovdqa %fs:0x100,%ymm4
   0x00007f72f475e8fb <+59>:    vmovdqa %fs:0x120,%ymm5
   0x00007f72f475e905 <+69>:    vmovdqa %fs:0x140,%ymm6
   0x00007f72f475e90f <+79>:    vmovdqa %fs:0x160,%ymm7
   0x00007f72f475e919 <+89>:    retq   

Searching for this issue elsewhere led me to this RHEL bug: 
https://bugzilla.redhat.com/show_bug.cgi?id=542731 which says that when you use 
clone() you need to handle setting up thread local storage yourself - ie you 
must pass CLONE_SETTLS. Looking at the gperftools source, it doesn't do this.

I can provide the core or perhaps a small test program off-list to gperftools 
developers upon request.

Original issue reported on code.google.com by tlip...@gmail.com on 10 Feb 2013 at 6:28

GoogleCodeExporter commented 9 years ago

If you could pass along the test program that would be great 
(chappedm@gmail.com).

Original comment by chapp...@gmail.com on 10 Mar 2013 at 8:43

Changed state: Started

GoogleCodeExporter commented 9 years ago

I think I have an explanation of this issue, along with a workaround or two.

The issue is that we use sys_clone() to clone the ListerThread in the same 
address space as the triggering thread, and sharing the same thread-local 
storage. This usually works out OK, but there's a race condition here in the 
dynamic loader (dl_runtime_resolve in the stacks above). If the triggering 
thread is doing a resolution (eg of sys_waitpid) at the same time as the 
ListerThread is doing a resolution (eg of sem_wait), then the two can conflict 
and end up triggering this issue.

There's some more gory detail here: 
http://www.mail-archive.com/utrace-devel@redhat.com/msg01944.html

I have two workarounds that appear to work:
1) set LD_BIND_NOW=1 in the environment. This forces all dynamic loading to 
happen ahead of time, so you can't hit this race.
2) make some "junk calls" of the relevant library functions before doing the 
sys_clone:

Index: src/base/linuxthreads.cc
===================================================================
--- src/base/linuxthreads.cc    (revision 208)
+++ src/base/linuxthreads.cc    (working copy)
@@ -614,6 +614,16 @@
      * to ListerThread actually executing
      */
     if (sem_init(&lock, 0, 0) == 0) {
+      // Workaround for issue #497: call various functions before cloning.
+      // his ensures that they are already loaded, since calling the dynamic
+      // loader from our CLONE_VM thread can race against the dynamic loader
+      // in the enclosing thread.
+      sem_post(&lock);
+      sem_wait(&lock);
+      prctl(PR_GET_DUMPABLE, 0, 0, 0, 0);
+        // First do a junk waitpid to trigger dynamic loading of waitpid
+      int junk;
+      sys0_waitpid(getpid(), &junk, WNOHANG);

       int clone_errno;
       clone_pid = local_clone((int (*)(void *))ListerThread, &args);

You may not need all of the above junk calls -- perhaps just the junk 
sys0_waitpid, so that the waitpid from the triggerer doesn't cause any dynamic 
loading. But with the above in place, I can't seem to repro the issue.

Original comment by tlip...@gmail.com on 25 Mar 2013 at 9:39

GoogleCodeExporter commented 9 years ago

Also, it's likely this only affects some applications -- I think glibc will 
only do the SSE register saveoff/restore when it sees that there is at least 
one function using SSE registers as part of its calling convention (the 
application I'm testing does use SSE)

Original comment by tlip...@gmail.com on 25 Mar 2013 at 9:44

ssett / gperftools

heapcheck crashes with illegal instruction (SIGILL) #497