qchbai / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Large reallocs in dlopen on OS X #374

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.  Build a project in 64-bit with tcmalloc_minimal.
2.  dlopen a framework with objective-c in it or trigger something that will 
(e.g., clicking on a menu in the menu bar).

What is the expected output? What do you see instead?
dlopen works fine, but in trying to remethodize any classes that have 
categories attached to them, it may try to realloc huge blocks because of the 
aggressive realloc growth added in response to Issue 52.  The built-in realloc 
does no such thing and hence performs fine.  See below for more information.

What version of the product are you using? On what operating system?
tcmalloc 1.8.3
OS X 10.6.8

Please provide any additional information below.
Things go horribly wrong in the attachMethodLists function, as reached in the 
following stack trace:
#0  0x00007fff8334427a in _realloc_internal ()
#1  0x00007fff833512ea in attachMethodLists ()
#2  0x00007fff83351716 in attachCategoryMethods ()
#3  0x00007fff833517c4 in remethodizeClass ()
#4  0x00007fff8334251b in _read_images ()
#5  0x00007fff83358721 in map_images_nolock ()
#6  0x00007fff83341761 in map_images ()
#7  0x00007fff5fc039b9 in 
__dyld__ZN4dyldL18notifyBatchPartialE17dyld_image_statesbPFPKcS0_jPK15dyld_image
_infoE ()
#8  0x00007fff5fc0c839 in 
__dyld__ZN11ImageLoader4linkERKNS_11LinkContextEbbRKNS_10RPathChainE ()
#9  0x00007fff5fc04d48 in 
__dyld__ZN4dyld4linkEP11ImageLoaderbRKNS0_10RPathChainE ()
#10 0x00007fff5fc08f68 in __dyld_dlopen ()
#11 0x00007fff8851de40 in dlopen ()

Now, this is the particular piece of problem code in attachMethodLists:
0x00007fff833512c5  <+0074>  lea    0x0(,%rax,8),%r13
0x00007fff833512cd  <+0082>  callq  0x7fff83359170 <dyld_stub_malloc_size>
0x00007fff833512d2  <+0087>  mov    %rax,%r12
0x00007fff833512d5  <+0090>  mov    -0x38(%rbp),%rax
0x00007fff833512d9  <+0094>  mov    0x20(%rax),%rbx
0x00007fff833512dd  <+0098>  lea    (%r12,%r13,1),%rsi
0x00007fff833512e1  <+0102>  mov    0x10(%rbx),%rdi
0x00007fff833512e5  <+0106>  callq  0x7fff8334426a <_realloc_internal>

Each time we call tc_realloc on an already seen chunk, tcmalloc tries to be 
smart and allocate 1.25x the amount requested.  In this case, we're adding 
category methods to NSObject which has a huge number of categories on it.  
Therefore, we call realloc on it a lot.  The lea instruction is used to simply 
add 8 bytes to the current block size.  After 10 times through this process, 
we've blown something up from a couple hundred KB to a couple hundred MB, when 
it really only wanted to increase it by 80 bytes.

For reference, the source for Apple's scalable zone allocator, the default OS X 
malloc, is available at 
http://www.opensource.apple.com/source/Libc/Libc-583/gen/scalable_malloc.c.  
They don't inflate the size of their realloc to try to be clever.  Admittedly, 
it's the code in attachMethodLists that's completely bogus and using 
malloc_size incorrectly, but there's nothing to be done about that except file 
a bug report (I filed rdar://10359348 with Apple: 
http://openradar.appspot.com/radar?id=1403415).

This bug makes tcmalloc essentially unusable on OS X in my application.  The 
only reason this became noticeable recently is because the switch to 
malloc_zones meant that all the loaded libraries were using tcmalloc as well.

This bug is in such a fundamental piece of code on OS X that it's hard to 
justify the current solution.  I think the easiest solution is to simply 
realloc as requested or by increments rather than aggressively, at least on OS 
X.  Alternatively, one could cheat the results of malloc_size and return the 
requested size instead of the memory block size.  This would mean that repeated 
reallocs would simply succeed.  This seems like a safe fix as well (though it 
would mess up anyone who was manually implementing their own realloc-like call 
by checking malloc_size to see if they should free and malloc a new block).

Original issue reported on code.google.com by mill...@gmail.com on 28 Oct 2011 at 1:00

GoogleCodeExporter commented 9 years ago
Wow, that's an interesting dissassembly output.  I'm surprised 
attachMethodLists is calling malloc_size().

The source for attachMethodLists seems to be available:
   http://www.opensource.apple.com/source/objc4/objc4-493.9/runtime/objc-runtime-new.mm?txt

I don't see any call to realloc in there, much less malloc_size().  Can you 
point out the source code that's being executed here?  I'd like to understand 
better what that code is doing.

Original comment by csilv...@gmail.com on 28 Oct 2011 at 1:11

GoogleCodeExporter commented 9 years ago
Phew--thought I was going crazy there for a second!  You linked to the latest 
version that ships with 10.7.2.  Here's the version for 10.6.8:
http://www.opensource.apple.com/source/objc4/objc4-437.3/runtime/objc-runtime-ne
w.m

And the happy block of code that's doing all the evil stuff:

static void 
attachMethodLists(class_t *cls, method_list_t **lists, int count, 
                  BOOL methodsFromBundle, BOOL *outVtablesAffected)
{
    rwlock_assert_writing(&runtimeLock);

    BOOL vtablesAffected = NO;
    size_t listsSize = count * sizeof(*lists);

    // Create or extend method list array
    // Leave `count` empty slots at the start of the array to be filled below.
    if (!cls->data->methods) {
        // no bonus method lists yet
        cls->data->methods = _calloc_internal(1 + count, sizeof(*lists));
    } else {
        size_t oldSize = malloc_size(cls->data->methods);
        cls->data->methods = 
            _realloc_internal(cls->data->methods, oldSize + listsSize);
        memmove(cls->data->methods + count, cls->data->methods, oldSize);
    }

Original comment by mill...@gmail.com on 28 Oct 2011 at 1:33

GoogleCodeExporter commented 9 years ago
Does this mean we can solve the problem by just telling people to upgrade? :-)

Yeah, this is just buggy code.  I'm glad the updated objc fixes it.

The question is what to do for this old code.  I don't want to revert the 
realloc behavior, because it really helps programs that realloc a lot, like 
python. (They're arguably still poorly written, but not as much as this.)  I 
don't want to have malloc_size() lie either -- that's just punishing apps that 
do the right thing.

I guess we'll be stuck with 10.6.x (and below) for a while, so it's worth 
coming up with a fix for this.  I don't want to punish python users using 
tcmalloc on windows, though, so I hate to change the realloc behavior, even 
with a typedef.  I guess we could lie about malloc_size() only for 10.6.x and 
below, though that makes me very sad.  Any other ideas?

Original comment by csilv...@gmail.com on 28 Oct 2011 at 1:54

GoogleCodeExporter commented 9 years ago
It's a tough question, I think.  I'm not sure the solution to Issue 52 was 
necessarily ideal because it seems like an optimization for some poorly written 
code in the Python implementation.  One thing might be to take a cue from 
Apple's szone.  If szone_realloc has to actually allocate a new block, it tries 
a deferred copy-on-read using vm_copy.  You'd have to take out the aggressive 
growth on realloc for that to be a fix for both issues, which would probably 
cause Python to slow down.  It might be slightly better than just doing the 
memcpy though.

You could put a note in the docs and add a configure option.  But that just 
pushes the burden onto the user in a bad way.

The compromise solution would be to allow it to grow aggressively at first but 
put a cap on it so it doesn't get unreasonable hugely.  But there might be 
legitimate use cases for very large sizes and I don't know what Python's 
realloc strategy is and how big those objects are getting.

I think I'm leaning towards lying in malloc_size if running on any system 
before 10.7.  I have trouble thinking of places where it would cause actual 
problems.  You could imagine someone implementing a custom realloc-type 
function as I mentioned that checks malloc_size to see if there's enough space. 
 For that use-case, there will be a performance hit because they'll be lied to 
about how much memory they actually had and have to malloc and free 
"erroneously".  But that seems like an artificial example from my perspective.  
Normally, those unassigned region at the end is pretty small, right?  In the 
case of tcmalloc's realloc, then it can get significant.  But if they're using 
realloc, why are they asking for malloc_size?  Besides profiling and debugging, 
what are they actually going to do with that information if not realloc?  I 
have trouble thinking up an answer to that... maybe you have some other 
hypothetical situations in which this would cause problems.  I agree it's 
inelegant and incorrect, but it seems fairly harmless.

For good measure, you could make sure that the tcmalloc versions of the call 
return the correct result, and just lie for the malloc_zone size function.

Original comment by mill...@gmail.com on 28 Oct 2011 at 2:42

GoogleCodeExporter commented 9 years ago
OK, let's go with the malloc_size hack.  I agree that we should only adjust the 
malloc-zone malloc_size, and not tc_malloc_size.

Would you like to draw up a patch?

Original comment by csilv...@gmail.com on 28 Oct 2011 at 6:20

GoogleCodeExporter commented 9 years ago
Yeah, I'm working on it right now.  As a heads up, I naively assumed there was 
already mechanism in place to get the requested size from a malloc'ed piece of 
memory.  I can't seem to find any such functionality.  I'm going to maintain 
that information by sticking a size_t at the back of the allocated block.  This 
change might cause some blocks to allocate slightly larger than before, since 
we need to have space for that size_t at the end.  Is that okay?  Can you think 
of a better way to do that?  This also means the code leaks out of 
libc_override_osx.h a bit into tcmalloc.cc:GetSizeWithCallback, which has to 
report (allocatedSize - sizeof(size_t)) so people don't inadvertently think 
they can write over that space.

Additionally, it seems this bug with the Objective-C runtime is even more 
confined--it's only on 64-bit applications under 10.5 and 10.6.  10.5 and 10.6 
only use the modern runtime (objc-runtime-new.m) for 64-bit applications, and 
use the legacy runtime (objc-runtime-old.m) in 32-bit.  See 
http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/ObjCRunti
meGuide/Articles/ocrtVersionsPlatforms.html for more information.

Original comment by mill...@gmail.com on 28 Oct 2011 at 5:38

GoogleCodeExporter commented 9 years ago
No, I afraid it won't work to store the requested size in the allocated block.  
To work around a bug in a client library like this, we'd want to make a really 
localized change that doesn't not affect performance or behavior of anyone else.

} Additionally, it seems this bug with the Objective-C runtime is even more 
confined

I'm wondering if we should just document it and leave it be, then.

Original comment by csilv...@gmail.com on 28 Oct 2011 at 6:59

GoogleCodeExporter commented 9 years ago
That might be best, though I must say that I would vote for the default 
behavior on the affected platforms to be that realloc doesn't grow 
aggressively.  szone_realloc does nothing particularly fancy besides that 
vm_copy so I doubt you'd see a speed penalty in Python.  Has anyone measured 
the performance relative to szone on OS X?  It seemed like Issue 52 was a Linux 
thing.

The downside of prioritizing the Issue 52 fix first in this case is that anyone 
hoping to use tcmalloc in a Cocoa application would have to modify it before 
it's usable on the problem platforms.

Original comment by mill...@gmail.com on 28 Oct 2011 at 7:30

GoogleCodeExporter commented 9 years ago
Do you want to test it?  You could try running the python script in issue 52, 
with a DYLD_PRELOAD (or however one does it on os x) of tcmalloc, and compare 
the speed with the realloc the way it is now, and back to a 'non-agressive' 
version.  If the non-agressive version works fine on os x, we could maybe put 
an #ifdef in there.

Original comment by csilv...@gmail.com on 28 Oct 2011 at 10:02

GoogleCodeExporter commented 9 years ago
I guess the older, broken code is getting older all the time.  I'll leave this 
bug open but lower its priority.  I still don't know what the best fix is, but 
I'm handing over maintainership of perftools, so I'm happy to say it is not 
Someone Else's Problem. :-)

Original comment by csilv...@gmail.com on 25 Jan 2012 at 11:45

GoogleCodeExporter commented 9 years ago
Just checking if there has been any further progress here. I am actively doing 
patch work for the next few days and would like to clear up as many issues as 
possible.

Original comment by chapp...@gmail.com on 21 Apr 2012 at 6:16

GoogleCodeExporter commented 9 years ago
Any update here?

Original comment by chapp...@gmail.com on 11 Mar 2013 at 2:26

GoogleCodeExporter commented 9 years ago
Sorry for not responding earlier--I must have missed the email from last year. 
We ended up not using tcmalloc in our project so I completely forgot about it.

While we were using it, I just had a workaround to ease up on the reallocation 
strategy and that solved the problem. I never got around to testing it against 
Python, so I'm not sure it's fit for inclusion.

Original comment by mill...@gmail.com on 11 Mar 2013 at 12:28

GoogleCodeExporter commented 9 years ago
Closing as too old and irrelevant.

Original comment by alkondratenko on 30 Aug 2013 at 12:04