RHEL4. Random crashes in run time on symbol resolution.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

The problem is not stable, I have no concrete guide to reproduce it. It 
does seem, however, happen when an application attempts to start a thread 
via pthread_create. I've seen other crashes with no threading involved (at 
least, on the application level).

What is the expected output? What do you see instead?

The application should not crash

What version of the product are you using? On what operating system?

PerfTools 1.4, patched according to ticket #201. RHEL4.

uname -a
Linux ~~~~~~ 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:39:47 EDT 2008 i686 i686 
i386 GNU/Linux

rpm -qa | grep glibc
glibc-common-2.3.4-2.41
glibc-devel-2.3.4-2.41
glibc-2.3.4-2.41
glibc-headers-2.3.4-2.41
glibc-kernheaders-2.4-9.1.103.EL

/lib/libc.so.6
GNU C Library stable release version 2.3.4, by Roland McGrath et al.
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.4.6 20060404 (Red Hat 3.4.6-9).
Compiled on a Linux 2.4.20 system on 2008-04-15.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        The C stubs add-on version 2.1.2.
        BIND-8.2.3-T5B
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
        Glibc-2.0 compatibility add-on by Cristian Gafton 
        GNU Libidn by Simon Josefsson
        libthread_db work sponsored by Alpha Processor Inc
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

Please provide any additional information below.

Our tests on RHEL4 are randomly crashing. The tests involve different code, 
some are multithreaded, some are not. All modules are linked with 
libtcmalloc_minimal.so.

I managed to recover a stack of one crash (which is not very easy, since 
it's on a remote server):

27/05/10 05:28:45 Using host libthread_db library 
"/lib/tls/libthread_db.so.1".
27/05/10 05:28:45 Core was generated by `/home/nb/KERN-
4.1/4.1/ASAPkernel/dev/unix/dll/i686-pc-linux-gnu-gcc/Release/CBO'.
27/05/10 05:28:45 Program terminated with signal 11, Segmentation fault.
27/05/10 05:28:47 #0  0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
27/05/10 05:28:47 
27/05/10 05:28:47 Thread 1 (process 23039):
27/05/10 05:28:47 #0  0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
27/05/10 05:28:47 #1  0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-
linux.so.2
27/05/10 05:28:47 #2  0x002c51d6 in fixup () from /lib/ld-linux.so.2
27/05/10 05:28:47 #3  0x002c5110 in _dl_runtime_resolve () from /lib/ld-
linux.so.2
27/05/10 05:28:47 #4  0x08066419 in CrashHandler (sig=11) at 
./src/SignalHandlerPosix.cpp:230
27/05/10 05:28:47 #5  <signal handler called>
27/05/10 05:28:47 #6  0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
27/05/10 05:28:47 #7  0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-
linux.so.2
27/05/10 05:28:47 #8  0x002c51d6 in fixup () from /lib/ld-linux.so.2
27/05/10 05:28:47 #9  0x002c5110 in _dl_runtime_resolve () from /lib/ld-
linux.so.2
27/05/10 05:28:47 #10 0x003d4fb9 in tcmalloc::CentralFreeList::InsertRange 
()
27/05/10 05:28:47    from /home/nb/KERN-
4.1/4.1/ASAPkernel/SubstituteBin/i686-pc-linux-gnu-
gcc/Release/libtcmalloc_minimal.so.0
27/05/10 05:28:47 #11 0x003d8683 in 
tcmalloc::ThreadCache::ReleaseToCentralCache ()
27/05/10 05:28:47    from /home/nb/KERN-
4.1/4.1/ASAPkernel/SubstituteBin/i686-pc-linux-gnu-
gcc/Release/libtcmalloc_minimal.so.0
27/05/10 05:28:47 #12 0x003d879a in tcmalloc::ThreadCache::Scavenge ()
27/05/10 05:28:47    from /home/nb/KERN-
4.1/4.1/ASAPkernel/SubstituteBin/i686-pc-linux-gnu-
gcc/Release/libtcmalloc_minimal.so.0
27/05/10 05:28:47 #13 0x003cfb67 in (anonymous 
namespace)::do_free_with_callback ()
27/05/10 05:28:47    from /home/nb/KERN-
4.1/4.1/ASAPkernel/SubstituteBin/i686-pc-linux-gnu-
gcc/Release/libtcmalloc_minimal.so.0
27/05/10 05:28:47 #14 0x003d1f6f in MallocBlock::ProcessFreeQueue ()
27/05/10 05:28:47    from /home/nb/KERN-
4.1/4.1/ASAPkernel/SubstituteBin/i686-pc-linux-gnu-
gcc/Release/libtcmalloc_minimal.so.0
27/05/10 05:28:47 #15 0x003ce407 in DebugDeallocate ()
27/05/10 05:28:47    from /home/nb/KERN-
4.1/4.1/ASAPkernel/SubstituteBin/i686-pc-linux-gnu-
gcc/Release/libtcmalloc_minimal.so.0
27/05/10 05:28:47 #16 0x003de860 in realloc ()
27/05/10 05:28:47    from /home/nb/KERN-
4.1/4.1/ASAPkernel/SubstituteBin/i686-pc-linux-gnu-
gcc/Release/libtcmalloc_minimal.so.0
27/05/10 05:28:47 #17 0x00525c44 in pthread_create@@GLIBC_2.1 () from 
/lib/tls/libpthread.so.0
27/05/10 05:28:47 #18 0x0806895f in Impl (this=0x942f990) at 
./src/SignalHandlerPosix.cpp:328
27/05/10 05:28:47 #19 0x0806751e in SignalHandler (this=0x942f9b0)
27/05/10 05:28:47     at ./src/SignalHandlerPosix.cpp:344
27/05/10 05:28:47 #20 0x080598ca in StartUp (lg=@0xbffaff30, 
bUseConsoleHandler=true, 
27/05/10 05:28:47     fileName=0x946b3e0 "localsettings.xml") at 
./src/StartUp.cpp:30
27/05/10 05:28:47 #21 0x080559a6 in (anonymous namespace)::COMMain 
(params=@0xbffaff7c)
27/05/10 05:28:47     at ./src/CBOSSinMain.cpp:81
27/05/10 05:28:47 #22 0x08055c83 in main (argc=1, argv=0xbffb0074) at 
./src/CBOSSinMain.cpp:130

Note that libtcmalloc_minimal.so.0 is actually a renamed 
libtcmalloc_minimal_debug.so.0. I did so in attempt to debug our 
applications and this particular problem.

Here the application is at its very startup, it attempts to create a thread 
to wait for signals. Apparently, the crash appears when the dynamic linker 
attempts to resolve a symbol in run time. The crash handler installed by 
the application fails to do anything useful since it also triggers symbol 
resolution. In other crashes the application was also in its very early 
stage, but I don't have stacks from these.

The problem is very hard to reproduce, out of ~1800 tests we run each 
night, about 5-10 of them fail with these sympthoms, almost every time 
different ones.

I'm suspecting that this is a bug in glibc, but I'm not sure. I'm not very 
knowledgeable in the interworkings of the dynamic linker to go digging into 
it.

I have a suggestion of a possible workaround in perftools, though. I tried 
to compile it with an additional linker flag "-Wl,-z -Wl,now", which would 
force the linker to resolve all symbols in the tcmalloc library immediately 
at its load time, instead of lazily, which is the default. I did one tests 
turnaround and it did not show any tests with the described crash. I'll 
keep monitoring, though. However, I suggest adding the mentioned flag to 
the library makefiles.

Original issue reported on code.google.com by andrey.s...@gmail.com on 28 May 2010 at 5:22

GoogleCodeExporter commented 9 years ago

What a strange bug.  I'm trying to imagine what symbol the dynamic loader could 
be 
trying to load.  InsertRange() doesn't make any exotic function calls or 
variable 
accesses.

Looking around, I see a few other reports of segfaults in do_lookup_x.  One 
reports 
it was trying to look up errno, which is maybe what's happening here.  There 
are a 
few guesses as to what might be going on, but nothing definitive.

I'd be more comfortable putting in documentation of your workaround, if I knew 
the 
problem was still happening in perftools 1.5.  Is it practical for you to 
update to 
the latest version?

Original comment by csilv...@gmail.com on 28 May 2010 at 3:46

Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

I'm afraid, not. But in any case, is there that much difference between 1.4 and 
1.5 in 
that part of the library? I don't have access to out code right now, I'll have 
a look 
after the weekend.

Original comment by andrey.s...@gmail.com on 28 May 2010 at 4:23

GoogleCodeExporter commented 9 years ago

It's hard to say what differences might cause the problems you're seeing.  I 
admit I 
don't understand it at all.  The theories I see on the web, as to what causes 
this 
kind of crash, are libc incompatibility problems.  Did you compile libtcmalloc 
on the 
same system you're running it on?

Here's another workaround you can try: modify your application to call some 
function 
that sets errno (maybe do something like read(1, ...)).  It would be 
interesting to 
see if either a) that causes a crash itself, or b) that fixes the crashes 
you've been 
seeing.  This assumes the symbol being looked up at crash time is errno.  If 
there's 
a way to verify that, it would be great, but I understand that might be too 
difficult.

Original comment by csilv...@gmail.com on 28 May 2010 at 4:58

GoogleCodeExporter commented 9 years ago

> Did you compile libtcmalloc on the same system you're running it on?

No, the library was built on another machine. It runs RHEL4 of a slightly older 
version, but has the same compiler. But regardless, binary compatibility across 
different versions of RHEL is a requirement for us.

> Here's another workaround you can try: modify your application to call some 
function that sets errno...

That would require quite an amount of work and would cause breakage. I'm not 
sure I'm 
allowed to do that. Besides, I wouldn't want to add dummy calls to our code 
since it 
does not guarantee that it won't break somewhere else at the library update in 
the 
future.

Original comment by andrey.s...@gmail.com on 31 May 2010 at 4:43

GoogleCodeExporter commented 9 years ago

BTW, I don't think that errno is the only candidate for being the cause. For 
instance, 
I see that on Linux SpinLock is implemented through a futex. If futex is not 
available, 
it involves sched_yield and nanosleep. Technically, _any_ symbol, including the 
one 
defined by tcmalloc, can trigger symbol resolution. So I'd better off to 
resolve them 
all at an early stage than to try to do it selectively, by hand.

Original comment by andrey.s...@gmail.com on 31 May 2010 at 4:57

GoogleCodeExporter commented 9 years ago

Just to be clear, I wasn't suggesting calling an errno function as a permanent
workaround to this problem, but merely as a test to see if I can figure out more
precisely what's going on here.  The right way to figure it out is to use a 
debugger,
of course, but you say that's really hard in your context, so we have to be more
creative.

} No, the library was built on another machine.

Does the new machine have the same libc as the old machine?

There are just a lot of variables here; right now my working hypothesis is the
problems you're seeing are due to something in your setup, not something in 
tcmalloc.
 If you can address all the variables that there currently are -- a more modern
perftools, a more homogenous execution environment -- then I'm happy to take a 
look,
but as things stand there's just too many variables to justify taking any 
particular
action.

Original comment by csilv...@gmail.com on 31 May 2010 at 3:50

GoogleCodeExporter commented 9 years ago

> Just to be clear, I wasn't suggesting calling an errno function as a permanent
> workaround to this problem, but merely as a test to see if I can figure out 
more
> precisely what's going on here.

I see. But in any case, as I understand, the symbol has to be referred from 
libtcmalloc since each symbol is referred to by GOT, which is module-specific 
[1]. 
I've inspected our code and it surely does call tcmalloc many times through 
operator 
new and, perhaps, malloc (it fills STL containers and strings a lot) before the 
crash 
occurs. I can't be sure these calls involve errno, though.

> Does the new machine have the same libc as the old machine?

uname -a
Linux ~~~~~~~ 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 
GNU/Linux

rpm -qa | grep glibc
glibc-2.3.4-2
glibc-headers-2.3.4-2
glibc-common-2.3.4-2
glibc-kernheaders-2.4-9.1.87
glibc-devel-2.3.4-2

/lib/libc.so.6
GNU C Library stable release version 2.3.4, by Roland McGrath et al.
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4).
Compiled on a Linux 2.4.20 system on 2004-12-20.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        The C stubs add-on version 2.1.2.
        BIND-8.2.3-T5B
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
        Glibc-2.0 compatibility add-on by Cristian Gafton
        GNU Libidn by Simon Josefsson
        libthread_db work sponsored by Alpha Processor Inc
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

[1] http://www.iecc.com/linker/linker10.html

Original comment by andrey.s...@gmail.com on 1 Jun 2010 at 5:19

GoogleCodeExporter commented 9 years ago

I will try to recompile tcmalloc without the suggested flag but with debugging 
info. 
Perhaps, it will show more clearly where in tcmalloc things go wrong. But 
reproducing 
the crash will be tricky and will take time.

Original comment by andrey.s...@gmail.com on 1 Jun 2010 at 5:24

GoogleCodeExporter commented 9 years ago

} glibc-2.3.4-2

And the other glibc is, it looks like,

glibc-common-2.3.4-2.41

I don't know what the difference is between them, but I wonder if that might be 
responsible for the problems you're seeing.

} I see. But in any case, as I understand, the symbol has to be referred from 
} libtcmalloc

Ah, good point.  Your approach of recompiling with debug info may be the most 
productive, then.  Let me know when you manage to reproduce it again.

Original comment by csilv...@gmail.com on 1 Jun 2010 at 6:30

GoogleCodeExporter commented 9 years ago

A little update: the crash is also present with 1.5. I couldn't collect the 
stack, but 
hopefully I'll be able to do that tomorrow.

Original comment by andrey.s...@gmail.com on 2 Jun 2010 at 4:06

GoogleCodeExporter commented 9 years ago

> I don't know what the difference is between them, but I wonder if that might 
be 
> responsible for the problems you're seeing.

Well, I told that I suspect glibc. However I don't think it's an ABI problem 
since it 
only differs by patch level. But that's as far as my speculations can go.

Original comment by andrey.s...@gmail.com on 2 Jun 2010 at 4:10

GoogleCodeExporter commented 9 years ago

I was luckier than I thought. Here's the stack:

#0  0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#1  0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2  0x002c51d6 in fixup () from /lib/ld-linux.so.2
#3  0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x0052c464 in fork () from /lib/tls/libpthread.so.0
#5  0x08066a16 in CrashHandler (sig=11) at ./src/SignalHandlerPosix.cpp:272
#6  <signal handler called>
#7  0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#8  0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#9  0x002c51d6 in fixup () from /lib/ld-linux.so.2
#10 0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#11 0x00a70619 in tcmalloc::CentralFreeList::InsertRange (this=0xa8a240, 
start=0x8481900, end=0x847cc60, N=-842150451) at ../src/central_freelist.cc:183
#12 0x00a74bc3 in tcmalloc::ThreadCache::ReleaseToCentralCache (this=0x83b3000, 
src=0x83b30ec, cl=18, N=53) at ../src/thread_cache.cc:214
#13 0x00a74cda in tcmalloc::ThreadCache::Scavenge (this=0x83b3000) at 
../src/thread_cache.cc:237
#14 0x00a6acf7 in do_free_with_callback (ptr=0x84808f0, 
invalid_free_fn=0xa668e4 
<(anonymous namespace)::InvalidFree(void*)>) at ../src/thread_cache.h:361
#15 0x00a689b5 in DebugDeallocate (ptr=Variable "ptr" is not available.
) at ../src/tcmalloc.cc:993
#16 0x00a7b6b7 in realloc (ptr=0x84b2d20, size=Variable "size" is not available.
) at ../src/debugallocation.cc:1064
#17 0x00525c44 in pthread_create@@GLIBC_2.1 () from /lib/tls/libpthread.so.0
#18 0x0806895f in Impl (this=0x83d48b0) at ./src/SignalHandlerPosix.cpp:328
#19 0x0806751e in SignalHandler (this=0x83d48d0) at 
./src/SignalHandlerPosix.cpp:344
#20 0x080598ca in StartUp (lg=@0xbff147e0, bUseConsoleHandler=true, 
fileName=0x841a3e0 "localsettings.xml") at ./src/StartUp.cpp:30
#21 0x080559a6 in (anonymous namespace)::COMMain (params=@0xbff1482c) at 
./src/CBOSSinMain.cpp:81
#22 0x08055c83 in main (argc=1, argv=0xbff14924) at ./src/CBOSSinMain.cpp:130

As before, the test uses libtcmalloc_minimal_debug.so. Only it's unpatched 
version 
1.5 with debug info.

Original comment by andrey.s...@gmail.com on 2 Jun 2010 at 4:34

GoogleCodeExporter commented 9 years ago

OK, like 183 is this:

} if (N == Static::sizemap()->num_objects_to_move(size_class_) &&

The rest of the 'if' is

}    MakeCacheSpace()) {

which may also be what gdb is reporting.

When you say you compiled with debug info, what is the exact set of compiler 
flags 
you used?  (What did you pass ./configure?)  Did you have optimization on as 
well?  
Being able to do this without optimization would probably be helpful as well.

num_objects_to_move() is just an array reference.  MakeCacheSpace() ends up 
being 
pretty big, hopefully too big to inline.  But it doesn't make any function 
calls out 
of tcmalloc either.  I don't know why the dynamic linker would be necessary in 
this 
code.

What does
  ldd <your app>
say?  What other libraries are being used besides tcmalloc_minimal_debug and 
libc and 
libstdc++?

It looks like you're using pthreads.  What is your commandline for creating 
your 
executable?  Is libtcmalloc_minimal last on the link line?  I'm wondering if 
maybe -
pthread comes after libtcmalloc, and tcmalloc is getting initialized with libc 
malloc, and then trying ot use tcmalloc later (for some realloc it's doing).  
That 
would cause a crash, which may evidence itself in what you're seeing here.

Original comment by csilv...@gmail.com on 2 Jun 2010 at 6:58

GoogleCodeExporter commented 9 years ago

> When you say you compiled with debug info, what is the exact set of compiler 
flags 
> you used?  (What did you pass ./configure?)  Did you have optimization on as 
well?

Debug info was enabled with -g3. Yes, optimization is on, with -O3 and a few 
other
flags (I don't have access to the script right now).

Regarding your other questions, libtcmalloc is the first in the linker command 
line.
It is loaded before any other libraries, including pthread. Other libraries 
include
STLPort, libstdc++, ICU and quite a few others.

Original comment by andrey.s...@gmail.com on 2 Jun 2010 at 7:15

GoogleCodeExporter commented 9 years ago

I think that is the problem.  Try putting libtcmalloc last on the linkline, 
after 
everything except for libc and libstdc++.  See if the problems go away then.  I 
suspect the problem is an alloc/free (or in your case, alloc/realloc) mismatch, 
due to 
when the libraries get loaded in.  The DL_NOW flag works around this problem by 
changing the way the dynamic loader does symbol resolution.

Original comment by csilv...@gmail.com on 2 Jun 2010 at 7:22

GoogleCodeExporter commented 9 years ago

I don't understand. If tcmalloc is loaded first, other libraries should use
malloc/free and friends from tcmalloc, and not from any other library. And 
that's
exactly what we're trying to achieve by using tcmalloc.

Original comment by andrey.s...@gmail.com on 2 Jun 2010 at 7:30

GoogleCodeExporter commented 9 years ago

> The DL_NOW flag works around this problem by changing the way the dynamic 
loader
> does symbol resolution.

I may be missing something, but as I understand the relocation order is not 
exactly
related to run time symbol resolution. The linker builds the process-wide table 
of
symbols as it loads modules, in the order of loading them. Lazy symbol 
resolution
then fills entries in module-specific GOTs according to this relocation table, 
and is
not related to module loading order.

Original comment by andrey.s...@gmail.com on 2 Jun 2010 at 7:38

GoogleCodeExporter commented 9 years ago

} If tcmalloc is loaded first, other libraries should use
} malloc/free and friends from tcmalloc

Yes, but to load tcmalloc first, you need to list it last on the linkline.

One way to verify the load order is to run ldd on your binary.  As I understand 
it, 
the dynamic loader reads these libraries from the bottom up.

} I may be missing something, but as I understand the relocation order is not 
exactly
} related to run time symbol resolution.

I'm probably the one missing something.  The issue is with weak symbols that 
are 
defined in one .so and then redefined in another.  I believe it's possible to 
get a 
different answer with DL_BIND_NOW than without, for libraries loaded between 
the 
first definition of the weak symbol and the second.  But I could be smoking 
crack.

Original comment by csilv...@gmail.com on 2 Jun 2010 at 8:32

GoogleCodeExporter commented 9 years ago

> One way to verify the load order is to run ldd on your binary.  As I 
understand it,
> the dynamic loader reads these libraries from the bottom up.

The output ldd produces is actually the result of linker loading the libraries. 
It is 
the linker who writes it, actually. I can't imagine how it can load libraries 
in 
reverse order.

> The issue is with weak symbols that are defined in one .so and then redefined 
in
> another.

Weak symbols are not redefined. If a symbol is present in two libs, the one who 
is 
loaded first defines it for the application. References to the symbol from the 
lib 
that is loaded second are relocated to point to the first lib. That way 
pointers to 
symbols stay stable along the whole run time of the application.

Original comment by andrey.s...@gmail.com on 3 Jun 2010 at 2:15

GoogleCodeExporter commented 9 years ago

BTW, I am so sure about the ldd output and symbol relocation because I can see 
that 
behavior confirmed in practice. For instance, in this very crash pthread calls 
to 
tcmalloc because it's loaded prior to libc and defines its realloc.

Original comment by andrey.s...@gmail.com on 3 Jun 2010 at 2:21

GoogleCodeExporter commented 9 years ago

As long as tcmalloc gets loaded after libc (because it's prior to it on the 
link 
line), everyone will use tcmalloc after it's loaded.  The question is what 
happens 
between the time libc is loaded and tcmalloc is loaded.  That is where the 
problem 
started for you, I'm guessing.

Anyway, the proof is in the pudding.  What happens when you move libtcmalloc to 
be 
last on the link line (except for libc and libstdc++)?

Original comment by csilv...@gmail.com on 3 Jun 2010 at 2:35

GoogleCodeExporter commented 9 years ago

> When you say you compiled with debug info, what is the exact set of compiler 
flags 
> you used?  (What did you pass ./configure?)  Did you have optimization on as 
well?

Here's the (simplified) build script:

mkdir tmp
cd tmp

export CC="gcc410"
export CXX="g++410"
export CFLAGS="-I ${ROOT_DIR}/ThirdParty/STLport/stlport -march=pentium4 -mmmx 
-msse 
-msse2 -mfpmath=sse -minline-all-stringops -O3 -ftree-vectorize 
-fno-strict-aliasing 
-fvisibility-inlines-hidden -g3"
export CXXFLAGS="$CFLAGS"
export LDFLAGS="-L${ROOT_DIR}/ThirdParty/STLport/lib/i686-pc-linux-gnu-gcc"
export LIBS="-lstlport_gcc"

../configure --enable-shared --disable-static --enable-frame-pointers
make -j 2

Original comment by andrey.s...@gmail.com on 3 Jun 2010 at 6:06

GoogleCodeExporter commented 9 years ago

We don't specify neither libc nor libstdc++ in the linker command line. The 
linker 
adds them implicitly as if they were specified last. Here's the ldd output:

        libtcmalloc_minimal.so.0 => 
/home/asemashe/Bin/Substitute/libtcmalloc_minimal.so.0 (0x004fe000)
        libstlport_gcc.so.5.1 => /home/asemashe/Bin/Actual/libstlport_gcc.so.5.1 
(0x00cf8000)
        librt.so.1 => /lib/tls/librt.so.1 (0x009c6000)
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00557000)
        libdl.so.2 => /lib/libdl.so.2 (0x0043b000)
        libicuuc.so.34 => /home/asemashe/Bin/Actual/libicuuc.so.34 (0x00111000)
        libicudata.so.34 => /home/asemashe/Bin/Actual/libicudata.so.34 (0x00d83000)
        libicui18n.so.34 => /home/asemashe/Bin/Actual/libicui18n.so.34 (0x007f2000)
        libicule.so.34 => /home/asemashe/Bin/Actual/libicule.so.34 (0x0091c000)
        libiculx.so.34 => /home/asemashe/Bin/Actual/libiculx.so.34 (0x0021b000)
        libicutu.so.34 => /home/asemashe/Bin/Actual/libicutu.so.34 (0x00224000)
        libboost_regex.so.1.40.0 => 
/home/asemashe/Bin/Actual/libboost_regex.so.1.40.0 (0x00317000)
        libboost_thread.so.1.40.0 => 
/home/asemashe/Bin/Actual/libboost_thread.so.1.40.0 (0x00238000)
        libwin32.so => /home/asemashe/Bin/Substitute/libwin32.so (0x005f8000)
        libported_com.so => /home/asemashe/Bin/Actual/libported_com.so (0x0024a000)
        libported_ole.so => /home/asemashe/Bin/Actual/libported_ole.so (0x00ac2000)
        libvas_regapi.so => /home/asemashe/Bin/Actual/libvas_regapi.so (0x00254000)
        libGlobalObserver.so => /home/asemashe/Bin/Actual/libGlobalObserver.so 
(0x0058d000)
        libboost_program_options.so.1.40.0 => 
/home/asemashe/Bin/Actual/libboost_program_options.so.1.40.0 (0x00746000)
        libgcc_s.so.1 => /opt/lib/libgcc_s.so.1 (0x00285000)
        libc.so.6 => /lib/tls/libc.so.6 (0x00608000)
        libstdc++.so.6 => /opt/lib/libstdc++.so.6 (0x009da000)
        libm.so.6 => /lib/tls/libm.so.6 (0x00441000)
        /lib/ld-linux.so.2 (0x002f0000)
        libboost_filesystem.so.1.40.0 => 
/home/asemashe/Bin/Actual/libboost_filesystem.so.1.40.0 (0x0028f000)
        libboost_system.so.1.40.0 => 
/home/asemashe/Bin/Actual/libboost_system.so.1.40.0 (0x002a4000)

Actually, I think that libstdc++ and other libs below it are brought in by 
dependent
libraries. STLPort depends on libstdc++ and libm, and some of our libs depend on
boost_system and boost_filesystem.

I still don't understand how tcmalloc could be loaded last and yet replace 
malloc 
functions from libc for other libraries. Even then, what will we discover if 
tcmalloc 
is moved to the end of the linker line? Will it disable other libs from using 
tcmalloc? Could you explain your theory so I can justify the experiment with 
our 
build?

Original comment by andrey.s...@gmail.com on 3 Jun 2010 at 6:34

GoogleCodeExporter commented 9 years ago

Hmm, I tried to reproduce the problem, but I can't.  Maybe the situation I was 
thinking of can only occur with dlopen-ed libraries or something.  Ok, I'm 
letting 
that theory go for the moment.

Unfortunately, I don't have a good one to replace it.  There's no credible need 
for a 
dl lookup at the time of the crash, as far as I can see for the stack trace.  I 
can't 
figure out from it, what's going on.

I think the next step would be to get a non-optimized stacktrace, with all code 
(both 
tcmalloc and your application) compiled with "-O0 -g".

I don't know how much debugging you want to do for this.  You've found a 
workaround 
that works for you, and I totally understand if you're happy to go with that 
and just 
move on.  But if not, are you up for trying to get a non-optimized stacktrace?

Original comment by csilv...@gmail.com on 3 Jun 2010 at 1:58

GoogleCodeExporter commented 9 years ago

> There's no credible need for a dl lookup at the time of the crash, as far as 
I can
> see for the stack trace.  I can't figure out from it, what's going on.

Well, there are functions called in that "if" statement. Unless inlined, any of 
them 
can go through PLT/GOT and trigger symbol resolution. I tried to track down 
which one 
it is, but did not succeed.

> But if not, are you up for trying to get a non-optimized stacktrace?

I can try with non-optimized tcmalloc next week. But the application will still 
be 
optimized because disabling it is a too major change. Its optimization doesn't 
do any 
harm in this case anyway.

Original comment by andrey.s...@gmail.com on 3 Jun 2010 at 2:43

GoogleCodeExporter commented 9 years ago

} I can try with non-optimized tcmalloc next week. But the application will 
still be 
} optimized because disabling it is a too major change. Its optimization 
doesn't do 
any 
} harm in this case anyway.

Sounds great.  You're right -- an unoptimized tcmalloc should be enough.

Original comment by csilv...@gmail.com on 3 Jun 2010 at 4:13

GoogleCodeExporter commented 9 years ago

Here's the stack with the unoptimized version:

#0  0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#1  0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2  0x002c51d6 in fixup () from /lib/ld-linux.so.2
#3  0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x0052c464 in fork () from /lib/tls/libpthread.so.0
#5  0x08066a16 in CrashHandler (sig=11) at ./src/SignalHandlerPosix.cpp:272
#6  <signal handler called>
#7  0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#8  0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#9  0x002c51d6 in fixup () from /lib/ld-linux.so.2
#10 0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#11 0x001e7502 in tcmalloc::CentralFreeList::InsertRange (this=0x2023a0, 
start=0x942f900, end=0x942ac60, N=32) at ../src/central_freelist.cc:183
#12 0x001ec993 in tcmalloc::ThreadCache::ReleaseToCentralCache (this=0x9361000, 
src=0x93610ec, cl=18, N=53) at ../src/thread_cache.cc:214
#13 0x001ecaa4 in tcmalloc::ThreadCache::Scavenge (this=0x9361000) at 
../src/thread_cache.cc:237
#14 0x001e2d90 in tcmalloc::ThreadCache::Deallocate (this=0x9361000, 
ptr=0x942e8f0, cl=12) at ../src/thread_cache.h:361
#15 0x001e2e68 in (anonymous namespace)::do_free_with_callback (ptr=0x942e8f0, 
invalid_free_fn=0x1de208 <(anonymous namespace)::InvalidFree(void*)>)
    at ../src/tcmalloc.cc:971
#16 0x001e2f7b in (anonymous namespace)::do_free (ptr=0x942e8f0) at 
../src/tcmalloc.cc:993
#17 0x001e3825 in MallocBlock::ProcessFreeQueue (b=0x9460d10, size=172, 
max_free_queue_size=10485760) at ../src/debugallocation.cc:603
#18 0x001e39ca in MallocBlock::Deallocate (this=0x9460d10, type=-271733872) at 
../src/debugallocation.cc:573
#19 0x001def0c in DebugDeallocate (ptr=0x9460d20, type=-271733872) at 
../src/debugallocation.cc:974
#20 0x001f2bcf in realloc (ptr=0x9460d20, size=1024) at 
../src/debugallocation.cc:1064
#21 0x00525c44 in pthread_create@@GLIBC_2.1 () from /lib/tls/libpthread.so.0
#22 0x0806895f in Impl (this=0x93828b0) at ./src/SignalHandlerPosix.cpp:328
#23 0x0806751e in SignalHandler (this=0x93828d0) at 
./src/SignalHandlerPosix.cpp:344
#24 0x080598ca in StartUp (lg=@0xbff501b0, bUseConsoleHandler=true, 
fileName=0x93bf3e0 "localsettings.xml") at ./src/StartUp.cpp:30
#25 0x080559a6 in (anonymous namespace)::COMMain (params=@0xbff501fc) at 
./src/CBOSSinMain.cpp:81
#26 0x08055c83 in main (argc=1, argv=0xbff502f4) at ./src/CBOSSinMain.cpp:130

Nothing new.

Original comment by andrey.s...@gmail.com on 8 Jun 2010 at 4:26

GoogleCodeExporter commented 9 years ago

I tried to analyze disassembly do uncover what function call causes the crash. 
It did not show it explicitly but from the context it looks like it's trying to 
call MakeCacheSpace() from the "if" statement. I don't know how to find it out 
for sure since I don't know how to decode the PLT entry correctly.

I attached the disassembly of the InsertRange() function (the execution leaves 
it at address 0x001e74fd) and the PLT entry it calls. Also, I attached the 
readelf output in case if you can make better use of it than I did. I'll keep 
the core file in case if you need any other info out of it.

Original comment by andrey.s...@gmail.com on 8 Jun 2010 at 4:56

Attachments:

debug.zip

GoogleCodeExporter commented 9 years ago

We're out of my depth, I think, but have some experts here who may be able to 
make sense of what's going on here.  I'll ping them.

Original comment by csilv...@gmail.com on 8 Jun 2010 at 6:52

GoogleCodeExporter commented 9 years ago

One more question: I see the function in frame 23 is called SignalHandler.  It 
doesn't look like it's actually in a signal handler at this time, but is that 
function also called via a signal handler?  I ask because calling 
pthread_create in a signal handler is definitely not kosher.  What signal 
handling does your application do?

Original comment by csilv...@gmail.com on 8 Jun 2010 at 7:40

GoogleCodeExporter commented 9 years ago

The SignalHandler is our function that sets up signal handling. It's not a 
handler and there is no signal at that point. As a part of its work, 
SignalHandler creates a thread that will wait for Ctrl+C in sigwait in a 
dedicated thread - that is the thread being spawned by pthread_create.

Original comment by andrey.s...@gmail.com on 9 Jun 2010 at 3:30

GoogleCodeExporter commented 9 years ago

And BTW, if you wonder about the naming, SignalHandler and Impl are actually 
class constructors.

Original comment by andrey.s...@gmail.com on 9 Jun 2010 at 3:35

GoogleCodeExporter commented 9 years ago

OK, thanks for the info.  It was a nice theory while it lasted...  Just to 
confirm: at the time the crash has happened, no actual signal-handling had been 
done yet, right?  (I do see that the signal handler is being called in the 
stacktrace you give, but that's already after the program was crashing, if I 
understand it right.)

From what I've seen so far, I'm pretty confident the problem isn't in tcmalloc. 
 I'm not sure exactly where it might be.  The next step, I think, would be to 
installed libc-debug so we can get more insight into what is actually happening 
during the crash, and maybe also to look more into the assembly as you've 
already started to do.

I don't know if you want to spend the time to do this, especially since you 
have a functioning workaround.  It may be not worth the time it takes to figure 
this out.

Original comment by csilv...@gmail.com on 9 Jun 2010 at 6:25

GoogleCodeExporter commented 9 years ago

Hello, Andrey,

I've looked over this issue, and have a strong suspicion that the problem
has nothing to do with tcmalloc. A more likely explanation is that you do
something non-kosher in your signal handlers (it is notably hard to write
correct multithreaded programs which handle signals).

Heap corruption is another possible candidate. Is your program
Valgrind-clean?

A couple of things might help to analyze this further.

First please post the output from GDB "thread apply all where" for the
coredump you already have.

If you can install glibc-debuginfo package, GDB should be able to show the
glibc source for do_lookup_x(). In that case, please also do "info locals"
in the crashing do_lookup_x() frame.

If you can't install glibc-debuginfo, please do (in do_lookup_x() frame):

 info regs
 disas

Thanks,

Original comment by ppluzhni...@google.com on 9 Jun 2010 at 6:31

GoogleCodeExporter commented 9 years ago

> Just to confirm: at the time the crash has happened, no actual 
signal-handling had
> been done yet, right?

That's right. The sigwait has not been called and no signals rose yet that I'm 
aware of. The CrashHandler is installed to handle SIGSEGV and SIGBUS 
synchronously, before the thread for sigwait is spawned. That's why it is 
called when the app crashes. But it has nothing to do with the crashes 
themselves since I've added it _after_ I started to observe the problem - in 
attempt to debug it.

> The next step, I think, would be to installed libc-debug...

Unfortunately, I don't have the power to alter the software on the machine, 
except for what I write.

> A more likely explanation is that you do something non-kosher in your signal
> handlers (it is notably hard to write correct multithreaded programs which 
handle
> signals).

Yes, I'm aware of the issues with writing signal handlers. I can assure you, 
there's nothing wrong with them. At least, it's not the signal handler what 
causes the crash since it hasn't been called yet.

> Is your program Valgrind-clean?

Yes, I checked that before creating the ticket. The tcmalloc_debug, which is 
actually run here, doesn't complain either.

> First please post the output from GDB "thread apply all where" for the
> coredump you already have.

It's the same as what is presented in this thread, since there is only one 
thread yet. pthread_create crashes before it spawns the second one.

> If you can't install glibc-debuginfo, please do (in do_lookup_x() frame)...

Ok, I'll do that tomorrow.

Original comment by andrey.s...@gmail.com on 9 Jun 2010 at 7:15

GoogleCodeExporter commented 9 years ago

> If you can't install glibc-debuginfo, please do (in do_lookup_x() frame)...

Here it is. I took registers of both calls to do_lookup_x.

Original comment by andrey.s...@gmail.com on 10 Jun 2010 at 4:01

Attachments:

do_lookup_x.txt

GoogleCodeExporter commented 9 years ago

At crash point: eax == 0xcdcdcdcd

Crashing instruction:
0x002c199e <do_lookup_x+94>:    mov    0x14(%eax),%esi

Since 0xcdcdcdcd is the deleted pattern, it is fairly safe to assume that 
something in do_lookup_x is accessing free()d memory.

The source reads:

25  do_lookup_x (const char *undef_name, unsigned long int hash,
26           const ElfW(Sym) *ref, struct sym_val *result,
27           struct r_scope_elem *scope, size_t i,
28           const struct r_found_version *const version, int flags,
29           struct link_map *skip, int type_class)
30  {
31    struct link_map **list = scope->r_list;
32    size_t n = scope->r_nlist;
33    struct link_map *map;
34  
35    do
36      {
37        const ElfW(Sym) *symtab;
38        const char *strtab;
39        const ElfW(Half) *verstab;
40        Elf_Symndx symidx;
41        const ElfW(Sym) *sym;
42        int num_versions = 0;
43        const ElfW(Sym) *versioned_sym = NULL;
44  
45        map = list[i]->l_real;
46  
47        /* Here come the extra test needed for `_dl_lookup_symbol_skip'.  */
48        if (skip != NULL && map == skip)
49      continue;
50  
51        /* Don't search the executable when resolving a copy reloc.  */
52        if ((type_class & ELF_RTYPE_CLASS_COPY) && map->l_type == 
lt_executable)
53      continue;

AFAICT, the crash is happening on line 45, and indeed offsetof(struct link_map, 
l_real) == 0x14.

So I think list[i] is dangling at that point.

The looks *very* similar to
  https://bugzilla.redhat.com/show_bug.cgi?id=210130
which was reported against glibc-2.3.4-2.25 and is alleged to have been fixed 
in 2.3.4-2.39

I believe you have an "old" machine at 2.3.4-2 and a "new" one at 2.3.4-2.41.

I am confused about which machine the crash actually happens on -- the old one 
or the new one?

If the former, you are likely hitting that RH/glibc bug 210130.

If the latter, I am not sure how to proceed; but I am 99% certain that this has 
nothing to do with tcmalloc itself.

Original comment by ppluzhni...@google.com on 10 Jun 2010 at 5:32

GoogleCodeExporter commented 9 years ago

Thanks for the detailed analysis.

> I am confused about which machine the crash actually happens on -- the old 
one or
> the new one?

The new one. I couldn't reproduce the problem on the old one, although I 
haven't run all our tests on it.

> but I am 99% certain that this has nothing to do with tcmalloc itself.

I agree. My initial point was that it may be worth to add the suggested 
workaround to 
either makefiles or the docs of tcmalloc. It doesn't hurt anyway.

Original comment by andrey.s...@gmail.com on 10 Jun 2010 at 7:10

GoogleCodeExporter commented 9 years ago

Adding "-Wl,-z,now" (which BTW is more correct than "-Wl,-z -Wl,now") to 
tcmalloc Makefile is (IMO) just covering the problem over -- the problem is 
still there, but it may (or may not) show in some other library, so the other 
library's author gets to deal with it instead of us :-)

I don't think it's reasonable to do that by default in perftools Makefile, but 
it may be reasonable to mention this in a README somewhere.

Original comment by ppluzhni...@google.com on 10 Jun 2010 at 4:44

GoogleCodeExporter commented 9 years ago

So far, I've only seen this once, so I think the right level of documentation 
is in this bug report. :-)  If we start seeing it more, I'll look to document 
it in the README or some such, though I'd be happier if I knew what was 
actually going on first.

It seems to me that the crashing is due to using tcmalloc's debugallocation: 
the dl is accessing freed memory, which just happens to work most of the time 
(since the memory isn't overwritten), but of course doesn't with tcmalloc.  
It's curious valgrind didn't complain though.  It's pretty clear we're in 
'accessing free memory'-land here, which is the kind of thing valgrind should 
find.

Thanks for all the effort you guys put into trying to track this down.  I'm 
going to close the bug as Invalid, since it doesn't seem to be perftools 
related, but now we have a good record in case someone else comes along with 
the same problem.

Original comment by csilv...@gmail.com on 10 Jun 2010 at 4:48

Changed state: Invalid

hectorchn / gperftools

RHEL4. Random crashes in run time on symbol resolution. #246