Closed GoogleCodeExporter closed 9 years ago
What a strange bug. I'm trying to imagine what symbol the dynamic loader could
be
trying to load. InsertRange() doesn't make any exotic function calls or
variable
accesses.
Looking around, I see a few other reports of segfaults in do_lookup_x. One
reports
it was trying to look up errno, which is maybe what's happening here. There
are a
few guesses as to what might be going on, but nothing definitive.
I'd be more comfortable putting in documentation of your workaround, if I knew
the
problem was still happening in perftools 1.5. Is it practical for you to
update to
the latest version?
Original comment by csilv...@gmail.com
on 28 May 2010 at 3:46
I'm afraid, not. But in any case, is there that much difference between 1.4 and
1.5 in
that part of the library? I don't have access to out code right now, I'll have
a look
after the weekend.
Original comment by andrey.s...@gmail.com
on 28 May 2010 at 4:23
It's hard to say what differences might cause the problems you're seeing. I
admit I
don't understand it at all. The theories I see on the web, as to what causes
this
kind of crash, are libc incompatibility problems. Did you compile libtcmalloc
on the
same system you're running it on?
Here's another workaround you can try: modify your application to call some
function
that sets errno (maybe do something like read(1, ...)). It would be
interesting to
see if either a) that causes a crash itself, or b) that fixes the crashes
you've been
seeing. This assumes the symbol being looked up at crash time is errno. If
there's
a way to verify that, it would be great, but I understand that might be too
difficult.
Original comment by csilv...@gmail.com
on 28 May 2010 at 4:58
> Did you compile libtcmalloc on the same system you're running it on?
No, the library was built on another machine. It runs RHEL4 of a slightly older
version, but has the same compiler. But regardless, binary compatibility across
different versions of RHEL is a requirement for us.
> Here's another workaround you can try: modify your application to call some
function that sets errno...
That would require quite an amount of work and would cause breakage. I'm not
sure I'm
allowed to do that. Besides, I wouldn't want to add dummy calls to our code
since it
does not guarantee that it won't break somewhere else at the library update in
the
future.
Original comment by andrey.s...@gmail.com
on 31 May 2010 at 4:43
BTW, I don't think that errno is the only candidate for being the cause. For
instance,
I see that on Linux SpinLock is implemented through a futex. If futex is not
available,
it involves sched_yield and nanosleep. Technically, _any_ symbol, including the
one
defined by tcmalloc, can trigger symbol resolution. So I'd better off to
resolve them
all at an early stage than to try to do it selectively, by hand.
Original comment by andrey.s...@gmail.com
on 31 May 2010 at 4:57
Just to be clear, I wasn't suggesting calling an errno function as a permanent
workaround to this problem, but merely as a test to see if I can figure out more
precisely what's going on here. The right way to figure it out is to use a
debugger,
of course, but you say that's really hard in your context, so we have to be more
creative.
} No, the library was built on another machine.
Does the new machine have the same libc as the old machine?
There are just a lot of variables here; right now my working hypothesis is the
problems you're seeing are due to something in your setup, not something in
tcmalloc.
If you can address all the variables that there currently are -- a more modern
perftools, a more homogenous execution environment -- then I'm happy to take a
look,
but as things stand there's just too many variables to justify taking any
particular
action.
Original comment by csilv...@gmail.com
on 31 May 2010 at 3:50
> Just to be clear, I wasn't suggesting calling an errno function as a permanent
> workaround to this problem, but merely as a test to see if I can figure out
more
> precisely what's going on here.
I see. But in any case, as I understand, the symbol has to be referred from
libtcmalloc since each symbol is referred to by GOT, which is module-specific
[1].
I've inspected our code and it surely does call tcmalloc many times through
operator
new and, perhaps, malloc (it fills STL containers and strings a lot) before the
crash
occurs. I can't be sure these calls involve errno, though.
> Does the new machine have the same libc as the old machine?
uname -a
Linux ~~~~~~~ 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386
GNU/Linux
rpm -qa | grep glibc
glibc-2.3.4-2
glibc-headers-2.3.4-2
glibc-common-2.3.4-2
glibc-kernheaders-2.4-9.1.87
glibc-devel-2.3.4-2
/lib/libc.so.6
GNU C Library stable release version 2.3.4, by Roland McGrath et al.
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4).
Compiled on a Linux 2.4.20 system on 2004-12-20.
Available extensions:
GNU libio by Per Bothner
crypt add-on version 2.1 by Michael Glad and others
linuxthreads-0.10 by Xavier Leroy
The C stubs add-on version 2.1.2.
BIND-8.2.3-T5B
NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Glibc-2.0 compatibility add-on by Cristian Gafton
GNU Libidn by Simon Josefsson
libthread_db work sponsored by Alpha Processor Inc
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
[1] http://www.iecc.com/linker/linker10.html
Original comment by andrey.s...@gmail.com
on 1 Jun 2010 at 5:19
I will try to recompile tcmalloc without the suggested flag but with debugging
info.
Perhaps, it will show more clearly where in tcmalloc things go wrong. But
reproducing
the crash will be tricky and will take time.
Original comment by andrey.s...@gmail.com
on 1 Jun 2010 at 5:24
} glibc-2.3.4-2
And the other glibc is, it looks like,
glibc-common-2.3.4-2.41
I don't know what the difference is between them, but I wonder if that might be
responsible for the problems you're seeing.
} I see. But in any case, as I understand, the symbol has to be referred from
} libtcmalloc
Ah, good point. Your approach of recompiling with debug info may be the most
productive, then. Let me know when you manage to reproduce it again.
Original comment by csilv...@gmail.com
on 1 Jun 2010 at 6:30
A little update: the crash is also present with 1.5. I couldn't collect the
stack, but
hopefully I'll be able to do that tomorrow.
Original comment by andrey.s...@gmail.com
on 2 Jun 2010 at 4:06
> I don't know what the difference is between them, but I wonder if that might
be
> responsible for the problems you're seeing.
Well, I told that I suspect glibc. However I don't think it's an ABI problem
since it
only differs by patch level. But that's as far as my speculations can go.
Original comment by andrey.s...@gmail.com
on 2 Jun 2010 at 4:10
I was luckier than I thought. Here's the stack:
#0 0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#1 0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2 0x002c51d6 in fixup () from /lib/ld-linux.so.2
#3 0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4 0x0052c464 in fork () from /lib/tls/libpthread.so.0
#5 0x08066a16 in CrashHandler (sig=11) at ./src/SignalHandlerPosix.cpp:272
#6 <signal handler called>
#7 0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#8 0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#9 0x002c51d6 in fixup () from /lib/ld-linux.so.2
#10 0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#11 0x00a70619 in tcmalloc::CentralFreeList::InsertRange (this=0xa8a240,
start=0x8481900, end=0x847cc60, N=-842150451) at ../src/central_freelist.cc:183
#12 0x00a74bc3 in tcmalloc::ThreadCache::ReleaseToCentralCache (this=0x83b3000,
src=0x83b30ec, cl=18, N=53) at ../src/thread_cache.cc:214
#13 0x00a74cda in tcmalloc::ThreadCache::Scavenge (this=0x83b3000) at
../src/thread_cache.cc:237
#14 0x00a6acf7 in do_free_with_callback (ptr=0x84808f0,
invalid_free_fn=0xa668e4
<(anonymous namespace)::InvalidFree(void*)>) at ../src/thread_cache.h:361
#15 0x00a689b5 in DebugDeallocate (ptr=Variable "ptr" is not available.
) at ../src/tcmalloc.cc:993
#16 0x00a7b6b7 in realloc (ptr=0x84b2d20, size=Variable "size" is not available.
) at ../src/debugallocation.cc:1064
#17 0x00525c44 in pthread_create@@GLIBC_2.1 () from /lib/tls/libpthread.so.0
#18 0x0806895f in Impl (this=0x83d48b0) at ./src/SignalHandlerPosix.cpp:328
#19 0x0806751e in SignalHandler (this=0x83d48d0) at
./src/SignalHandlerPosix.cpp:344
#20 0x080598ca in StartUp (lg=@0xbff147e0, bUseConsoleHandler=true,
fileName=0x841a3e0 "localsettings.xml") at ./src/StartUp.cpp:30
#21 0x080559a6 in (anonymous namespace)::COMMain (params=@0xbff1482c) at
./src/CBOSSinMain.cpp:81
#22 0x08055c83 in main (argc=1, argv=0xbff14924) at ./src/CBOSSinMain.cpp:130
As before, the test uses libtcmalloc_minimal_debug.so. Only it's unpatched
version
1.5 with debug info.
Original comment by andrey.s...@gmail.com
on 2 Jun 2010 at 4:34
OK, like 183 is this:
} if (N == Static::sizemap()->num_objects_to_move(size_class_) &&
The rest of the 'if' is
} MakeCacheSpace()) {
which may also be what gdb is reporting.
When you say you compiled with debug info, what is the exact set of compiler
flags
you used? (What did you pass ./configure?) Did you have optimization on as
well?
Being able to do this without optimization would probably be helpful as well.
num_objects_to_move() is just an array reference. MakeCacheSpace() ends up
being
pretty big, hopefully too big to inline. But it doesn't make any function
calls out
of tcmalloc either. I don't know why the dynamic linker would be necessary in
this
code.
What does
ldd <your app>
say? What other libraries are being used besides tcmalloc_minimal_debug and
libc and
libstdc++?
It looks like you're using pthreads. What is your commandline for creating
your
executable? Is libtcmalloc_minimal last on the link line? I'm wondering if
maybe -
pthread comes after libtcmalloc, and tcmalloc is getting initialized with libc
malloc, and then trying ot use tcmalloc later (for some realloc it's doing).
That
would cause a crash, which may evidence itself in what you're seeing here.
Original comment by csilv...@gmail.com
on 2 Jun 2010 at 6:58
> When you say you compiled with debug info, what is the exact set of compiler
flags
> you used? (What did you pass ./configure?) Did you have optimization on as
well?
Debug info was enabled with -g3. Yes, optimization is on, with -O3 and a few
other
flags (I don't have access to the script right now).
Regarding your other questions, libtcmalloc is the first in the linker command
line.
It is loaded before any other libraries, including pthread. Other libraries
include
STLPort, libstdc++, ICU and quite a few others.
Original comment by andrey.s...@gmail.com
on 2 Jun 2010 at 7:15
I think that is the problem. Try putting libtcmalloc last on the linkline,
after
everything except for libc and libstdc++. See if the problems go away then. I
suspect the problem is an alloc/free (or in your case, alloc/realloc) mismatch,
due to
when the libraries get loaded in. The DL_NOW flag works around this problem by
changing the way the dynamic loader does symbol resolution.
Original comment by csilv...@gmail.com
on 2 Jun 2010 at 7:22
I don't understand. If tcmalloc is loaded first, other libraries should use
malloc/free and friends from tcmalloc, and not from any other library. And
that's
exactly what we're trying to achieve by using tcmalloc.
Original comment by andrey.s...@gmail.com
on 2 Jun 2010 at 7:30
> The DL_NOW flag works around this problem by changing the way the dynamic
loader
> does symbol resolution.
I may be missing something, but as I understand the relocation order is not
exactly
related to run time symbol resolution. The linker builds the process-wide table
of
symbols as it loads modules, in the order of loading them. Lazy symbol
resolution
then fills entries in module-specific GOTs according to this relocation table,
and is
not related to module loading order.
Original comment by andrey.s...@gmail.com
on 2 Jun 2010 at 7:38
} If tcmalloc is loaded first, other libraries should use
} malloc/free and friends from tcmalloc
Yes, but to load tcmalloc first, you need to list it last on the linkline.
One way to verify the load order is to run ldd on your binary. As I understand
it,
the dynamic loader reads these libraries from the bottom up.
} I may be missing something, but as I understand the relocation order is not
exactly
} related to run time symbol resolution.
I'm probably the one missing something. The issue is with weak symbols that
are
defined in one .so and then redefined in another. I believe it's possible to
get a
different answer with DL_BIND_NOW than without, for libraries loaded between
the
first definition of the weak symbol and the second. But I could be smoking
crack.
Original comment by csilv...@gmail.com
on 2 Jun 2010 at 8:32
> One way to verify the load order is to run ldd on your binary. As I
understand it,
> the dynamic loader reads these libraries from the bottom up.
The output ldd produces is actually the result of linker loading the libraries.
It is
the linker who writes it, actually. I can't imagine how it can load libraries
in
reverse order.
> The issue is with weak symbols that are defined in one .so and then redefined
in
> another.
Weak symbols are not redefined. If a symbol is present in two libs, the one who
is
loaded first defines it for the application. References to the symbol from the
lib
that is loaded second are relocated to point to the first lib. That way
pointers to
symbols stay stable along the whole run time of the application.
Original comment by andrey.s...@gmail.com
on 3 Jun 2010 at 2:15
BTW, I am so sure about the ldd output and symbol relocation because I can see
that
behavior confirmed in practice. For instance, in this very crash pthread calls
to
tcmalloc because it's loaded prior to libc and defines its realloc.
Original comment by andrey.s...@gmail.com
on 3 Jun 2010 at 2:21
As long as tcmalloc gets loaded after libc (because it's prior to it on the
link
line), everyone will use tcmalloc after it's loaded. The question is what
happens
between the time libc is loaded and tcmalloc is loaded. That is where the
problem
started for you, I'm guessing.
Anyway, the proof is in the pudding. What happens when you move libtcmalloc to
be
last on the link line (except for libc and libstdc++)?
Original comment by csilv...@gmail.com
on 3 Jun 2010 at 2:35
> When you say you compiled with debug info, what is the exact set of compiler
flags
> you used? (What did you pass ./configure?) Did you have optimization on as
well?
Here's the (simplified) build script:
mkdir tmp
cd tmp
export CC="gcc410"
export CXX="g++410"
export CFLAGS="-I ${ROOT_DIR}/ThirdParty/STLport/stlport -march=pentium4 -mmmx
-msse
-msse2 -mfpmath=sse -minline-all-stringops -O3 -ftree-vectorize
-fno-strict-aliasing
-fvisibility-inlines-hidden -g3"
export CXXFLAGS="$CFLAGS"
export LDFLAGS="-L${ROOT_DIR}/ThirdParty/STLport/lib/i686-pc-linux-gnu-gcc"
export LIBS="-lstlport_gcc"
../configure --enable-shared --disable-static --enable-frame-pointers
make -j 2
Original comment by andrey.s...@gmail.com
on 3 Jun 2010 at 6:06
We don't specify neither libc nor libstdc++ in the linker command line. The
linker
adds them implicitly as if they were specified last. Here's the ldd output:
libtcmalloc_minimal.so.0 =>
/home/asemashe/Bin/Substitute/libtcmalloc_minimal.so.0 (0x004fe000)
libstlport_gcc.so.5.1 => /home/asemashe/Bin/Actual/libstlport_gcc.so.5.1
(0x00cf8000)
librt.so.1 => /lib/tls/librt.so.1 (0x009c6000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00557000)
libdl.so.2 => /lib/libdl.so.2 (0x0043b000)
libicuuc.so.34 => /home/asemashe/Bin/Actual/libicuuc.so.34 (0x00111000)
libicudata.so.34 => /home/asemashe/Bin/Actual/libicudata.so.34 (0x00d83000)
libicui18n.so.34 => /home/asemashe/Bin/Actual/libicui18n.so.34 (0x007f2000)
libicule.so.34 => /home/asemashe/Bin/Actual/libicule.so.34 (0x0091c000)
libiculx.so.34 => /home/asemashe/Bin/Actual/libiculx.so.34 (0x0021b000)
libicutu.so.34 => /home/asemashe/Bin/Actual/libicutu.so.34 (0x00224000)
libboost_regex.so.1.40.0 =>
/home/asemashe/Bin/Actual/libboost_regex.so.1.40.0 (0x00317000)
libboost_thread.so.1.40.0 =>
/home/asemashe/Bin/Actual/libboost_thread.so.1.40.0 (0x00238000)
libwin32.so => /home/asemashe/Bin/Substitute/libwin32.so (0x005f8000)
libported_com.so => /home/asemashe/Bin/Actual/libported_com.so (0x0024a000)
libported_ole.so => /home/asemashe/Bin/Actual/libported_ole.so (0x00ac2000)
libvas_regapi.so => /home/asemashe/Bin/Actual/libvas_regapi.so (0x00254000)
libGlobalObserver.so => /home/asemashe/Bin/Actual/libGlobalObserver.so
(0x0058d000)
libboost_program_options.so.1.40.0 =>
/home/asemashe/Bin/Actual/libboost_program_options.so.1.40.0 (0x00746000)
libgcc_s.so.1 => /opt/lib/libgcc_s.so.1 (0x00285000)
libc.so.6 => /lib/tls/libc.so.6 (0x00608000)
libstdc++.so.6 => /opt/lib/libstdc++.so.6 (0x009da000)
libm.so.6 => /lib/tls/libm.so.6 (0x00441000)
/lib/ld-linux.so.2 (0x002f0000)
libboost_filesystem.so.1.40.0 =>
/home/asemashe/Bin/Actual/libboost_filesystem.so.1.40.0 (0x0028f000)
libboost_system.so.1.40.0 =>
/home/asemashe/Bin/Actual/libboost_system.so.1.40.0 (0x002a4000)
Actually, I think that libstdc++ and other libs below it are brought in by
dependent
libraries. STLPort depends on libstdc++ and libm, and some of our libs depend on
boost_system and boost_filesystem.
I still don't understand how tcmalloc could be loaded last and yet replace
malloc
functions from libc for other libraries. Even then, what will we discover if
tcmalloc
is moved to the end of the linker line? Will it disable other libs from using
tcmalloc? Could you explain your theory so I can justify the experiment with
our
build?
Original comment by andrey.s...@gmail.com
on 3 Jun 2010 at 6:34
Hmm, I tried to reproduce the problem, but I can't. Maybe the situation I was
thinking of can only occur with dlopen-ed libraries or something. Ok, I'm
letting
that theory go for the moment.
Unfortunately, I don't have a good one to replace it. There's no credible need
for a
dl lookup at the time of the crash, as far as I can see for the stack trace. I
can't
figure out from it, what's going on.
I think the next step would be to get a non-optimized stacktrace, with all code
(both
tcmalloc and your application) compiled with "-O0 -g".
I don't know how much debugging you want to do for this. You've found a
workaround
that works for you, and I totally understand if you're happy to go with that
and just
move on. But if not, are you up for trying to get a non-optimized stacktrace?
Original comment by csilv...@gmail.com
on 3 Jun 2010 at 1:58
> There's no credible need for a dl lookup at the time of the crash, as far as
I can
> see for the stack trace. I can't figure out from it, what's going on.
Well, there are functions called in that "if" statement. Unless inlined, any of
them
can go through PLT/GOT and trigger symbol resolution. I tried to track down
which one
it is, but did not succeed.
> But if not, are you up for trying to get a non-optimized stacktrace?
I can try with non-optimized tcmalloc next week. But the application will still
be
optimized because disabling it is a too major change. Its optimization doesn't
do any
harm in this case anyway.
Original comment by andrey.s...@gmail.com
on 3 Jun 2010 at 2:43
} I can try with non-optimized tcmalloc next week. But the application will
still be
} optimized because disabling it is a too major change. Its optimization
doesn't do
any
} harm in this case anyway.
Sounds great. You're right -- an unoptimized tcmalloc should be enough.
Original comment by csilv...@gmail.com
on 3 Jun 2010 at 4:13
Here's the stack with the unoptimized version:
#0 0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#1 0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2 0x002c51d6 in fixup () from /lib/ld-linux.so.2
#3 0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4 0x0052c464 in fork () from /lib/tls/libpthread.so.0
#5 0x08066a16 in CrashHandler (sig=11) at ./src/SignalHandlerPosix.cpp:272
#6 <signal handler called>
#7 0x002c199e in do_lookup_x () from /lib/ld-linux.so.2
#8 0x002c1e22 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#9 0x002c51d6 in fixup () from /lib/ld-linux.so.2
#10 0x002c5110 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#11 0x001e7502 in tcmalloc::CentralFreeList::InsertRange (this=0x2023a0,
start=0x942f900, end=0x942ac60, N=32) at ../src/central_freelist.cc:183
#12 0x001ec993 in tcmalloc::ThreadCache::ReleaseToCentralCache (this=0x9361000,
src=0x93610ec, cl=18, N=53) at ../src/thread_cache.cc:214
#13 0x001ecaa4 in tcmalloc::ThreadCache::Scavenge (this=0x9361000) at
../src/thread_cache.cc:237
#14 0x001e2d90 in tcmalloc::ThreadCache::Deallocate (this=0x9361000,
ptr=0x942e8f0, cl=12) at ../src/thread_cache.h:361
#15 0x001e2e68 in (anonymous namespace)::do_free_with_callback (ptr=0x942e8f0,
invalid_free_fn=0x1de208 <(anonymous namespace)::InvalidFree(void*)>)
at ../src/tcmalloc.cc:971
#16 0x001e2f7b in (anonymous namespace)::do_free (ptr=0x942e8f0) at
../src/tcmalloc.cc:993
#17 0x001e3825 in MallocBlock::ProcessFreeQueue (b=0x9460d10, size=172,
max_free_queue_size=10485760) at ../src/debugallocation.cc:603
#18 0x001e39ca in MallocBlock::Deallocate (this=0x9460d10, type=-271733872) at
../src/debugallocation.cc:573
#19 0x001def0c in DebugDeallocate (ptr=0x9460d20, type=-271733872) at
../src/debugallocation.cc:974
#20 0x001f2bcf in realloc (ptr=0x9460d20, size=1024) at
../src/debugallocation.cc:1064
#21 0x00525c44 in pthread_create@@GLIBC_2.1 () from /lib/tls/libpthread.so.0
#22 0x0806895f in Impl (this=0x93828b0) at ./src/SignalHandlerPosix.cpp:328
#23 0x0806751e in SignalHandler (this=0x93828d0) at
./src/SignalHandlerPosix.cpp:344
#24 0x080598ca in StartUp (lg=@0xbff501b0, bUseConsoleHandler=true,
fileName=0x93bf3e0 "localsettings.xml") at ./src/StartUp.cpp:30
#25 0x080559a6 in (anonymous namespace)::COMMain (params=@0xbff501fc) at
./src/CBOSSinMain.cpp:81
#26 0x08055c83 in main (argc=1, argv=0xbff502f4) at ./src/CBOSSinMain.cpp:130
Nothing new.
Original comment by andrey.s...@gmail.com
on 8 Jun 2010 at 4:26
I tried to analyze disassembly do uncover what function call causes the crash.
It did not show it explicitly but from the context it looks like it's trying to
call MakeCacheSpace() from the "if" statement. I don't know how to find it out
for sure since I don't know how to decode the PLT entry correctly.
I attached the disassembly of the InsertRange() function (the execution leaves
it at address 0x001e74fd) and the PLT entry it calls. Also, I attached the
readelf output in case if you can make better use of it than I did. I'll keep
the core file in case if you need any other info out of it.
Original comment by andrey.s...@gmail.com
on 8 Jun 2010 at 4:56
Attachments:
We're out of my depth, I think, but have some experts here who may be able to
make sense of what's going on here. I'll ping them.
Original comment by csilv...@gmail.com
on 8 Jun 2010 at 6:52
One more question: I see the function in frame 23 is called SignalHandler. It
doesn't look like it's actually in a signal handler at this time, but is that
function also called via a signal handler? I ask because calling
pthread_create in a signal handler is definitely not kosher. What signal
handling does your application do?
Original comment by csilv...@gmail.com
on 8 Jun 2010 at 7:40
The SignalHandler is our function that sets up signal handling. It's not a
handler and there is no signal at that point. As a part of its work,
SignalHandler creates a thread that will wait for Ctrl+C in sigwait in a
dedicated thread - that is the thread being spawned by pthread_create.
Original comment by andrey.s...@gmail.com
on 9 Jun 2010 at 3:30
And BTW, if you wonder about the naming, SignalHandler and Impl are actually
class constructors.
Original comment by andrey.s...@gmail.com
on 9 Jun 2010 at 3:35
OK, thanks for the info. It was a nice theory while it lasted... Just to
confirm: at the time the crash has happened, no actual signal-handling had been
done yet, right? (I do see that the signal handler is being called in the
stacktrace you give, but that's already after the program was crashing, if I
understand it right.)
From what I've seen so far, I'm pretty confident the problem isn't in tcmalloc.
I'm not sure exactly where it might be. The next step, I think, would be to
installed libc-debug so we can get more insight into what is actually happening
during the crash, and maybe also to look more into the assembly as you've
already started to do.
I don't know if you want to spend the time to do this, especially since you
have a functioning workaround. It may be not worth the time it takes to figure
this out.
Original comment by csilv...@gmail.com
on 9 Jun 2010 at 6:25
Hello, Andrey,
I've looked over this issue, and have a strong suspicion that the problem
has nothing to do with tcmalloc. A more likely explanation is that you do
something non-kosher in your signal handlers (it is notably hard to write
correct multithreaded programs which handle signals).
Heap corruption is another possible candidate. Is your program
Valgrind-clean?
A couple of things might help to analyze this further.
First please post the output from GDB "thread apply all where" for the
coredump you already have.
If you can install glibc-debuginfo package, GDB should be able to show the
glibc source for do_lookup_x(). In that case, please also do "info locals"
in the crashing do_lookup_x() frame.
If you can't install glibc-debuginfo, please do (in do_lookup_x() frame):
info regs
disas
Thanks,
Original comment by ppluzhni...@google.com
on 9 Jun 2010 at 6:31
> Just to confirm: at the time the crash has happened, no actual
signal-handling had
> been done yet, right?
That's right. The sigwait has not been called and no signals rose yet that I'm
aware of. The CrashHandler is installed to handle SIGSEGV and SIGBUS
synchronously, before the thread for sigwait is spawned. That's why it is
called when the app crashes. But it has nothing to do with the crashes
themselves since I've added it _after_ I started to observe the problem - in
attempt to debug it.
> The next step, I think, would be to installed libc-debug...
Unfortunately, I don't have the power to alter the software on the machine,
except for what I write.
> A more likely explanation is that you do something non-kosher in your signal
> handlers (it is notably hard to write correct multithreaded programs which
handle
> signals).
Yes, I'm aware of the issues with writing signal handlers. I can assure you,
there's nothing wrong with them. At least, it's not the signal handler what
causes the crash since it hasn't been called yet.
> Is your program Valgrind-clean?
Yes, I checked that before creating the ticket. The tcmalloc_debug, which is
actually run here, doesn't complain either.
> First please post the output from GDB "thread apply all where" for the
> coredump you already have.
It's the same as what is presented in this thread, since there is only one
thread yet. pthread_create crashes before it spawns the second one.
> If you can't install glibc-debuginfo, please do (in do_lookup_x() frame)...
Ok, I'll do that tomorrow.
Original comment by andrey.s...@gmail.com
on 9 Jun 2010 at 7:15
> If you can't install glibc-debuginfo, please do (in do_lookup_x() frame)...
Here it is. I took registers of both calls to do_lookup_x.
Original comment by andrey.s...@gmail.com
on 10 Jun 2010 at 4:01
Attachments:
At crash point: eax == 0xcdcdcdcd
Crashing instruction:
0x002c199e <do_lookup_x+94>: mov 0x14(%eax),%esi
Since 0xcdcdcdcd is the deleted pattern, it is fairly safe to assume that
something in do_lookup_x is accessing free()d memory.
The source reads:
25 do_lookup_x (const char *undef_name, unsigned long int hash,
26 const ElfW(Sym) *ref, struct sym_val *result,
27 struct r_scope_elem *scope, size_t i,
28 const struct r_found_version *const version, int flags,
29 struct link_map *skip, int type_class)
30 {
31 struct link_map **list = scope->r_list;
32 size_t n = scope->r_nlist;
33 struct link_map *map;
34
35 do
36 {
37 const ElfW(Sym) *symtab;
38 const char *strtab;
39 const ElfW(Half) *verstab;
40 Elf_Symndx symidx;
41 const ElfW(Sym) *sym;
42 int num_versions = 0;
43 const ElfW(Sym) *versioned_sym = NULL;
44
45 map = list[i]->l_real;
46
47 /* Here come the extra test needed for `_dl_lookup_symbol_skip'. */
48 if (skip != NULL && map == skip)
49 continue;
50
51 /* Don't search the executable when resolving a copy reloc. */
52 if ((type_class & ELF_RTYPE_CLASS_COPY) && map->l_type ==
lt_executable)
53 continue;
AFAICT, the crash is happening on line 45, and indeed offsetof(struct link_map,
l_real) == 0x14.
So I think list[i] is dangling at that point.
The looks *very* similar to
https://bugzilla.redhat.com/show_bug.cgi?id=210130
which was reported against glibc-2.3.4-2.25 and is alleged to have been fixed
in 2.3.4-2.39
I believe you have an "old" machine at 2.3.4-2 and a "new" one at 2.3.4-2.41.
I am confused about which machine the crash actually happens on -- the old one
or the new one?
If the former, you are likely hitting that RH/glibc bug 210130.
If the latter, I am not sure how to proceed; but I am 99% certain that this has
nothing to do with tcmalloc itself.
Original comment by ppluzhni...@google.com
on 10 Jun 2010 at 5:32
Thanks for the detailed analysis.
> I am confused about which machine the crash actually happens on -- the old
one or
> the new one?
The new one. I couldn't reproduce the problem on the old one, although I
haven't run all our tests on it.
> but I am 99% certain that this has nothing to do with tcmalloc itself.
I agree. My initial point was that it may be worth to add the suggested
workaround to
either makefiles or the docs of tcmalloc. It doesn't hurt anyway.
Original comment by andrey.s...@gmail.com
on 10 Jun 2010 at 7:10
Adding "-Wl,-z,now" (which BTW is more correct than "-Wl,-z -Wl,now") to
tcmalloc Makefile is (IMO) just covering the problem over -- the problem is
still there, but it may (or may not) show in some other library, so the other
library's author gets to deal with it instead of us :-)
I don't think it's reasonable to do that by default in perftools Makefile, but
it may be reasonable to mention this in a README somewhere.
Original comment by ppluzhni...@google.com
on 10 Jun 2010 at 4:44
So far, I've only seen this once, so I think the right level of documentation
is in this bug report. :-) If we start seeing it more, I'll look to document
it in the README or some such, though I'd be happier if I knew what was
actually going on first.
It seems to me that the crashing is due to using tcmalloc's debugallocation:
the dl is accessing freed memory, which just happens to work most of the time
(since the memory isn't overwritten), but of course doesn't with tcmalloc.
It's curious valgrind didn't complain though. It's pretty clear we're in
'accessing free memory'-land here, which is the kind of thing valgrind should
find.
Thanks for all the effort you guys put into trying to track this down. I'm
going to close the bug as Invalid, since it doesn't seem to be perftools
related, but now we have a good record in case someone else comes along with
the same problem.
Original comment by csilv...@gmail.com
on 10 Jun 2010 at 4:48
Original issue reported on code.google.com by
andrey.s...@gmail.com
on 28 May 2010 at 5:22