Closed GoogleCodeExporter closed 9 years ago
The program you ran has a memory leak. You need to be doing
delete[] data;
not
delete data;
I don't know why you didn't see increased memory use with the default memory
allocator. It should also have been eating up all your memory.
Original comment by csilv...@gmail.com
on 15 Sep 2011 at 6:48
No, it actually didn't matter. I corrected it, but still had the same issue.
Original comment by chiyoung...@gmail.com
on 15 Sep 2011 at 7:01
Ok, reopening the bug. Am going on vacation but will take a look when I have a
chance.
One thing you can try in the meantime: periodically (maybe when resetting the
size back to minimum), print out the MallocExtension::instance()->GetStats() to
see where tcmalloc thinks the memory is all going.
Original comment by csilv...@gmail.com
on 15 Sep 2011 at 7:20
I looked into this issue; I was able to reproduce the problem. Big chunks of
memory come from the PageHeap; it allocates memory from the kernel. Up to
kMinSystemAlloc this works acceptably; blocks of size kMinSystemAlloc are
allocated from the kernel and re-used. After kMinSystemAlloc, blocks of the
exact size allocated are requested from the kernel, so the subsequent block
cannot reuse the previous allocation in this allocation pattern. (The kernel
returns non-contiguous blocks, so no help there!) Because of this, each chunk
requires kernel allocation, and this memory cannot really be returned, just
unmapped.
It is possible to tweak GrowHeap to aggressively return unused memory from the
huge object list, then the virtual memory utilization grows dramatically though
the resident memory size remains more acceptable (still not great, but a big
improvement)
A trivial patch that verifies this is the issue is to do the below in GrowHeap,
to always request extra memory from the kernel:
if (n > kMaxValidPages) return false;
Length ask = (n>kMinSystemAlloc) ? n : static_cast<Length>(kMinSystemAlloc);
+ ask = ask * 2;
size_t actual_size;
The program then runs fine on my machine; virtual size goes to about 2.5GB and
resident size is about 5MB (because we're not actually touching most of the
memory). Rounding to the allocation up to the next power of two gives even
better behaviour (VSZ 1GB, resident about 4MB).
So this particular allocation pattern (step-increasing "huge" allocations) is
pathological for the current allocation strategy. It would be fairly easy to
fix by allocating big memory chunks with padding, or by pre-allocating a huge
chunk of memory. I think the latter is probably preferable for most scenarios
(and can be done without changing the perftools library). I presume this came
up "in the real world"? Can you share some information on what the scenario
was?
Addendum: Even with the rounding-up "fix", the code still continues to
periodically (about once per iteration) call GrowHeap, and it therefore "leaks"
virtual memory, even though the total memory footprint isn't changing. This is
because of the periodic release strategy; running "TCMALLOC_RELEASE_RATE=0
./crash-tcmalloc" stops this behaviour.
Original comment by jus...@fathomdb.com
on 20 Sep 2011 at 1:46
Thanks very much. These details help me understand the issue more exactly. Yes,
this issue happened in one of our real applications where we maintain the big
hash table in memory and perform a lot of append operations on the hash table's
keys although we put the limit on a value's max size.
Original comment by chiyoung...@gmail.com
on 20 Sep 2011 at 6:28
Jus..., thanks for tracking this down! I had thought that the big allocations
would be reused by tcmalloc, because of the huge item freelist. But I don't
know very much about this part of the code; perhaps there isn't even a huge
item freelist. Such a list would definitely help in this case (assuming the
list was large enough), though I don't know how it would need to be tweaked for
other workloads.
In any case, it sounds like more aggressive reclaiming of freelists may help in
your situation; you can do this without recompiling via the envvar
TCMALLOC_RELEASE_RATE. You can also manually call
MallocExtension::instance()->ReleaseFreeMemory().
} So this particular allocation pattern (step-increasing "huge" allocations) is
} pathological for the current allocation strategy.
I think that's right -- the allocator is definitely biased towards lots of
small allocations, and only occassional big ones. If performance is an issue,
it may be worthwhile to try to rewrite the app to have that property in any
case, since lots of large allocations probably also means lots of data copying,
which is slow.
I'm re-closing WontFix, since I think the allocator is working as intended
here. Hopefully some of the workarounds here will help you.
Original comment by csilv...@gmail.com
on 20 Sep 2011 at 4:00
I think WontFix is right. In answer to your questions: Big allocations come
direct from the kernel; there is a freelist (managed by the PageHeap class);
but this allocation pattern ensures that allocations don't fit into existing
chunks until the cycle is repeated (by which time we've allocated ...100MB +
100.02MB + 100.04MB + ... + 500MB)
The one concern is that because it's not possible to truly release kernel
memory (it can be madvise DONTNEED, but not released), then if this step-wise
growth is actually a common allocation pattern then we should consider padding
big allocation blocks (perhaps controlled by an env flag). However, I don't
really understand the details of the use-case that chiyoung... described - it
sounds like a custom hashtable with some assumptions about the memory allocator
that work for the stock allocator but not for tcmalloc. In that case it's
probably better to tweak the hashtable and avoid the memory copying, as you say.
Original comment by jus...@fathomdb.com
on 20 Sep 2011 at 4:21
Jus... says "The kernel returns non-contiguous blocks" and this is very bad.
Maybe kernel should be fixed?
We should be able to use sbrk as long as possible,
and even if we switch to mmap it is preferable to keep heap contiguous, right?
Original comment by pafi...@gmail.com
on 9 Oct 2012 at 7:45
AFAIK glibc malloc will simply do mmap/munmap for such huge chunks. IMHO that's
viable option. If someone allocates 1+ megs, cost of filling that memory with
something useful may be well comparable to overhead of mmap/munmap. And it'll
avoid problem of having to reuse those pieces of memory but just always
releasing them back to OS.
Original comment by alkondratenko
on 6 Mar 2013 at 12:03
I'm reopening this because I believe most of this can be fixed with relatively
small changes.
I also believe that issue 443 is related and one of my patches fixes tcmalloc
on at least example program posted there.
Fixes can be seen here
https://github.com/alk/gperftools/commits/some-fragmentation-avoidance or in
attachments below (in git format but applicable via patch -p1).
Original comment by alkondratenko
on 9 Mar 2013 at 1:55
Attachments:
Hi, thanks for the patches. Indeed I checked the patches on a PPC64 and x86_64
and they seems to fix the issue.
Related to 0001 patch I just what kind of performance issue might arise from
constantly trying to release all the pages when a new pagehead is requested.
Related to 0002, although it does not fix 443 (I checked on a PPC64 and x86_64
and to fix 443 I think my patch is still needed) it does help on reducing the
virtual memory usage. Just some question: why did you pick 8MB? Do you think it
is worth to an and environment variable to try tune it of 8MB is already
suffice for most cases?
Original comment by zatr...@gmail.com
on 10 Mar 2013 at 4:03
I was somewhat concerned by potential performance degradation too. But my
thinking is that grow heap should be infrequent, so so some additional work
before calling grow heap should be irrelevant.
Another thing to consider is minor page fault that we will get from those just
unmapped spans. But there again my thinking is: unless we're badly fragmented
we should be doing GrowHeap with really few free spans, so normally count of
released pages should be small. If we are fragmented then unmapping those spans
may help us coalesce them with other free and unmapped spans, and if it does
not then at very least we will release actual memory back to OS, eating just
virtual address space.
Regarding 8 megs. I've played with various sizes and I was observing number of
free spans after free phase is done. With 128 megs I was seeing about 10 large
spans after first pass, which have grown to about 20 in second pass. My
understanding is those were mostly caused by some caching of free lists. With 8
megs I was seeing decent count of free spans (AFAIR about 32 on first pass) and
I believe 8 is small enough to be fine for smaller environments such as ARMs.
I've just came up with idea of starting with small batches and as metadata
usage grows (as part of general heap grows) we can start using massive batches.
Let me also note that in second patch I believe I'm missing some locking. While
malloc metadata appears to be added under it's own higher-level locks, I've
found other users of metadata allocation function in non-malloc code so I
believe mutual exclusion is required.
Original comment by alkondratenko
on 10 Mar 2013 at 6:37
I agree with your comments, I'd just wonder if there is a performance testcase
we can use to evaluate if this is a hit in performance. However the testcase
trigger a nasty issue and it should fixed.
I checked here and besides the page heap code itself, I noticed two other place
that use MetaDataAlloc utilization:
1. Static::InitStaticVars and there is no need to locking on this phase.
2. ThreadCache::CreateCacheIfNecessary and its access is locked.
Did i overlooking something here?
Original comment by zatr...@gmail.com
on 10 Mar 2013 at 2:24
It's used by PageHeapAllocator which is used in number of places. But you may
be right that they are all guarded already. On the other hand some future user
of MetaDataAlloc may be unaware that it's not thread safe and given current
implementation grabs locks anyways (inside kernel as part of sbrk or mmap plus
tcmalloc's spinlock in system_alloc.cc) I believe some extra guarding and safer
semantics may be still good idea. I'd like to think about it a bit more.
Regarding possible regression I agree. Give me a bit of time to see what's
worst case behavior is to make sure we're not making any case worse. Or at the
very least we should understand how bad possible downside is.
Original comment by alkondratenko
on 11 Mar 2013 at 7:36
Yeah, both references I cited use the PageHeapAllocator. And I don't feel
inclined to add more locking where there is no need: I believe it will better
to add a comment saying the MetaDataAlloc should always be used guarded by
locks.
Original comment by zatr...@gmail.com
on 11 Mar 2013 at 7:56
I think worst case is page fault per page allocated/freed in short running
programs that don't reach stable virtual address space usage. I'm going to
limit badness by only releasing everything back if grow heap crosses certain
bound. I.e. something like vm-size-before div 128M != vm-size-after div 128M.
Original comment by alkondratenko
on 15 Mar 2013 at 5:09
Here's latest revision. Changes are:
* locking around internals of MetaDataAlloc (see more below)
* more conservative logic of releasing memory back to OS. See my comment above
and full details are in code.
Regarding locking. Even if all users of MetaDataAlloc were guarded by locks I'm
not seeing any one lock that they are all using. Particularly it's clear that
most users of MetaDataAlloc grab pageheap_lock, but I've found that
stacktrace_allocator is actually used without holding this lock.
Original comment by alkondratenko
on 1 Apr 2013 at 5:03
Attachments:
Hi, thanks for the patch. Sorry for the delay on revision, and I have some
questions:
0001-issue-368-unmap-free-spans-and-retry-before-growing-.patch : I kind
confused how
did you chose the conditions to test for pages release:
- stats_.free_bytes + stats_.unmapped_bytes >= stats_.system_bytes / 4 -> Why
did you
chose div 4 in these case? Could you elaborate?
- stats_.system_bytes / kForcedCoalesceInterval != (stats_.system_bytes + (n <<
kPageShift)) / kForcedCoalesceInterval) -> since both term are divided by
'kForcedCoalesceInterval', the division is not really needed.
- On ReleaseAtLeastNPages(static_cast<Length>(0x7fffffff)) I believe you meant
to
release as much pages as possible, right? Since Lenght is a uintptr_t, should it
be 0xffffffff for 32 bits and 0xffffffffffffffff for 64 bits?
Original comment by zatr...@gmail.com
on 9 Apr 2013 at 4:44
On ReleaseAtLeastNPages call. Recall that it's not in bytes but in pages. So
big 32-bit value should be big enough.
Division by kForcedCoalesceInterval is required. Don't forget it's integer
division. This code checks if we're crossing one of x % kForcedCoalesceInterval
== 0 boundaries.
Division by 4 is part of check to not bother (and risk quadratic behavior) if
amount of free memory is relatively small.
Original comment by alkondratenko
on 9 Apr 2013 at 5:38
Understood, I did some confusion about the values ReleaseAtLeastNPages. I was
just wondering if it's worth on adding a configurable parameter for this.
I did some internal benchmarks and results looks ok. I'm fine with the patch.
Original comment by zatr...@gmail.com
on 11 Apr 2013 at 3:40
Thanks for review. Somehow I didn't receive any notification from your comment.
Will commit soon. Thanks a lot again.
Original comment by alkondratenko
on 6 May 2013 at 7:36
Committed into svn
Original comment by alkondratenko
on 6 May 2013 at 7:51
I believe you forgot to add 'large_heap_fragmentation_unittest.cc', the make is
failing by not finding this file.
Original comment by zatr...@gmail.com
on 7 May 2013 at 7:24
The handling of MetaDataAlloc for bytes >= kMetadataAllocChunkSize looks
incorrect (0002-issue-368-443-allocate-metadata-in-big-batches.patch)
- kMetadataBigAllocThreshold bytes is allocated, regardless of 'bytes' value.
Original comment by pafi...@gmail.com
on 7 May 2013 at 8:53
Indeed I messed up putting this into svn. large_heap_fragmentation_unittest.cc
is now added. Thanks for pointing this out
Original comment by alkondratenko
on 7 May 2013 at 10:25
Thanks, pafinde... indeed you're right. Fixed. Thanks.
Original comment by alkondratenko
on 7 May 2013 at 10:29
On a CentOS 5.2 x64, I tried running crash-tcmalloc with the latest version of
tcmalloc (rev 218), I still see virt size growing to 1.2G and stays there.
OTOH, when I run it with glibc malloc, the virt size grows to 528MB and after
that virt size drops to 50k. How does glibc do this? Is there a way tcmalloc
can also do the same?
RES size is of course small for both (< 4kbytes)
The reason I'm concerned with virt size is that we have an application whose
virt size grows to > 16GB on some workloads where the RES size is ~ 9GB. Any
time the application crashes, the core size is the virt size. It becomes
unmanageable when this happens in customer deployments.
Original comment by shash...@gmail.com
on 19 May 2013 at 3:47
Tcmalloc approach is not always most fragmentation friendly. I'd like to know a
bit more about what exactly you're trying to run. Also it would be nice if you
could compare with 2.0.
Original comment by alkondratenko
on 19 May 2013 at 7:39
Issue 545 has been merged into this issue.
Original comment by alkondratenko
on 13 Jul 2013 at 10:39
Issue 544 has been merged into this issue.
Original comment by alkondratenko
on 13 Jul 2013 at 10:40
Original issue reported on code.google.com by
chiyoung...@gmail.com
on 15 Sep 2011 at 6:25