Memory bloating issues on Linux (CentOS)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. I wrote the simple application that creates a char array with the max size 
500MB and frees it repeatedly:

#include <pthread.h>
#include <iostream>
#include <string>

#define NUM_THREADS 1
#define MAX_SIZE 500000000

static void *launch_crash_thread(void *arg) {
    char *data = NULL;
    size_t size = 20000;
    while (true) {
        std::cout << "size: " << size << std::endl;
        delete data;
        data = new char[size];
        size += 20000;
        size = size > MAX_SIZE ? 20000 : size;
        usleep(rand() % 1000);
    }
}

int main() {
    pthread_t threads[NUM_THREADS];
    for (int i = 0; i < NUM_THREADS; ++i) {
        int rc = pthread_create(&threads[i], NULL, launch_crash_thread, NULL);
    }

    for (int i = 0; i < NUM_THREADS; ++i) {
        int rc = pthread_join(threads[i], NULL);
    }

    return 0;
}

2. Then, compile it by linking to tcmalloc, "g++ -ltcmalloc -lpthread 
crash-tcmalloc.cc -o crash-tcmalloc"

3. Run the executable

What is the expected output? What do you see instead?

After 15 - 20 sec, the resident and virtual memory usage of my application 
reached to 7.5GB and 4320GB, respectively. Actually, the memory usage grew 
continuously:

top - 23:01:28 up 160 days,  8:03,  2 users,  load average: 0.61, 0.15, 0.05
Tasks: 127 total,   2 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.6%us,  8.7%sy,  0.0%ni, 62.8%id, 27.6%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   8140108k total,  8093084k used,    47024k free,      256k buffers
Swap: 10241396k total,  1080076k used,  9161320k free,     5504k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                        
15261 root      17   0 4320g 7.5g  316 S  7.9 96.4   0:01.51 crash-tcmalloc

In addition, there was a heavy swap in the system:

[root@localhost ~]# vmstat 1 100
procs -----------memory---------- ---swap-- -----io---- --system-- 
-----cpu------
 r  b   swpd          free      buff  cache   si   so            bi    bo          in      cs   us sy id wa st
 2  1 1876020  47076    200   5832    2    1             10    85         0        0     1  1 98  0  0
 1  0 1946428  45364    156   6168  512 70540   512 70540 1415  958  1 11 59 29  0
 1  0 2023476  43520    144   5720  588 77024   688 77024 1454 1015  1 10 60 29  0
 2  0 2087988  45532    148   5244  588 64544   616 64592 1400  919  1 11 57 31  0
 1  1 2158644  43172    144   5368  544 70676   544 70676 1420  983  1 12 56 31  0
 1  1 2237492  44008    140   5072  608 78892   608 78892 1431 1061  1 13 58 29  0
 1  1 2318388  43812    132   4960  512 80896   512 80896 1399 1060  1 12 53 34  0
 0  2 2382900  44872    132   5312  512 64512   512 64512 1411  913  1 10 61 28  0
 0  2 2473012  44860    132   5336  672 90112   672 90112 1500 1199  1 15 57 27  0
...

I also compiled it without linking to tcmalloc, "g++ -lpthread test.cc -o 
crash-tcmalloc", but didn't see any high memory or swap usage.

What version of the product are you using? On what operating system?

google-perftools: 1.8.2
OS: CentOS 5.4

Please provide any additional information below.

I don't think this issue is specific to CentOS, and saw the same issues on 
other linux distributions.

Original issue reported on code.google.com by chiyoung...@gmail.com on 15 Sep 2011 at 6:25

GoogleCodeExporter commented 9 years ago

The program you ran has a memory leak.  You need to be doing
   delete[] data;
not
   delete data;

I don't know why you didn't see increased memory use with the default memory 
allocator.  It should also have been eating up all your memory.

Original comment by csilv...@gmail.com on 15 Sep 2011 at 6:48

Changed state: Invalid
Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

No, it actually didn't matter. I corrected it, but still had the same issue.

Original comment by chiyoung...@gmail.com on 15 Sep 2011 at 7:01

GoogleCodeExporter commented 9 years ago

Ok, reopening the bug.  Am going on vacation but will take a look when I have a 
chance.

One thing you can try in the meantime: periodically (maybe when resetting the 
size back to minimum), print out the MallocExtension::instance()->GetStats() to 
see where tcmalloc thinks the memory is all going.

Original comment by csilv...@gmail.com on 15 Sep 2011 at 7:20

Changed state: New

GoogleCodeExporter commented 9 years ago

I looked into this issue; I was able to reproduce the problem.  Big chunks of 
memory come from the PageHeap; it allocates memory from the kernel.  Up to 
kMinSystemAlloc this works acceptably; blocks of size kMinSystemAlloc are 
allocated from the kernel  and re-used.  After kMinSystemAlloc, blocks of the 
exact size allocated are requested from the kernel, so the subsequent block 
cannot reuse the previous allocation in this allocation pattern. (The kernel 
returns non-contiguous blocks, so no help there!)  Because of this, each chunk 
requires kernel allocation, and this memory cannot really be returned, just 
unmapped.

It is possible to tweak GrowHeap to aggressively return unused memory from the 
huge object list, then the virtual memory utilization grows dramatically though 
the resident memory size remains more acceptable (still not great, but a big 
improvement)

A trivial patch that verifies this is the issue is to do the below in GrowHeap, 
to always request extra memory from the kernel:

   if (n > kMaxValidPages) return false;
   Length ask = (n>kMinSystemAlloc) ? n : static_cast<Length>(kMinSystemAlloc);
+  ask = ask * 2;
   size_t actual_size;

The program then runs fine on my machine; virtual size goes to about 2.5GB and 
resident size is about 5MB (because we're not actually touching most of the 
memory).  Rounding to the allocation up to the next power of two gives even 
better behaviour (VSZ 1GB, resident about 4MB). 

So this particular allocation pattern (step-increasing "huge" allocations) is 
pathological for the current allocation strategy.  It would be fairly easy to 
fix by allocating big memory chunks with padding, or by pre-allocating a huge 
chunk of memory.  I think the latter is probably preferable for most scenarios 
(and can be done without changing the perftools library).  I presume this came 
up "in the real world"?  Can you share some information on what the scenario 
was?

Addendum: Even with the rounding-up "fix", the code still continues to 
periodically (about once per iteration) call GrowHeap, and it therefore "leaks" 
virtual memory, even though the total memory footprint isn't changing.  This is 
because of the periodic release strategy; running "TCMALLOC_RELEASE_RATE=0 
./crash-tcmalloc" stops this behaviour.

Original comment by jus...@fathomdb.com on 20 Sep 2011 at 1:46

GoogleCodeExporter commented 9 years ago

Thanks very much. These details help me understand the issue more exactly. Yes, 
this issue happened in one of our real applications where we maintain the big 
hash table in memory and perform a lot of append operations on the hash table's 
keys although we put the limit on a value's max size.

Original comment by chiyoung...@gmail.com on 20 Sep 2011 at 6:28

GoogleCodeExporter commented 9 years ago

Jus..., thanks for tracking this down!  I had thought that the big allocations 
would be reused by tcmalloc, because of the huge item freelist.  But I don't 
know very much about this part of the code; perhaps there isn't even a huge 
item freelist.  Such a list would definitely help in this case (assuming the 
list was large enough), though I don't know how it would need to be tweaked for 
other workloads.

In any case, it sounds like more aggressive reclaiming of freelists may help in 
your situation; you can do this without recompiling via the envvar 
TCMALLOC_RELEASE_RATE.  You can also manually call 
MallocExtension::instance()->ReleaseFreeMemory().

} So this particular allocation pattern (step-increasing "huge" allocations) is
} pathological for the current allocation strategy.

I think that's right -- the allocator is definitely biased towards lots of 
small allocations, and only occassional big ones.  If performance is an issue, 
it may be worthwhile to try to rewrite the app to have that property in any 
case, since lots of large allocations probably also means lots of data copying, 
which is slow.

I'm re-closing WontFix, since I think the allocator is working as intended 
here.  Hopefully some of the workarounds here will help you.

Original comment by csilv...@gmail.com on 20 Sep 2011 at 4:00

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

I think WontFix is right.  In answer to your questions: Big allocations come 
direct from the kernel;  there is a freelist (managed by the PageHeap class); 
but this allocation pattern ensures that allocations don't fit into existing 
chunks until the cycle is repeated (by which time we've allocated ...100MB + 
100.02MB + 100.04MB + ... + 500MB)

The one concern is that because it's not possible to truly release kernel 
memory (it can be madvise DONTNEED, but not released), then if this step-wise 
growth is actually a common allocation pattern then we should consider padding 
big allocation blocks (perhaps controlled by an env flag).  However, I don't 
really understand the details of the use-case that chiyoung... described - it 
sounds like a custom hashtable with some assumptions about the memory allocator 
that work for the stock allocator but not for tcmalloc.  In that case it's 
probably better to tweak the hashtable and avoid the memory copying, as you say.

Original comment by jus...@fathomdb.com on 20 Sep 2011 at 4:21

GoogleCodeExporter commented 9 years ago

Jus... says "The kernel returns non-contiguous blocks" and this is very bad.
Maybe kernel should be fixed?
We should be able to use sbrk as long as possible,
and even if we switch to mmap it is preferable to keep heap contiguous, right?

Original comment by pafi...@gmail.com on 9 Oct 2012 at 7:45

GoogleCodeExporter commented 9 years ago

AFAIK glibc malloc will simply do mmap/munmap for such huge chunks. IMHO that's 
viable option. If someone allocates 1+ megs, cost of filling that memory with 
something useful may be well comparable to overhead of mmap/munmap. And it'll 
avoid problem of having to reuse those pieces of memory but just always 
releasing them back to OS.

Original comment by alkondratenko on 6 Mar 2013 at 12:03

GoogleCodeExporter commented 9 years ago

I'm reopening this because I believe most of this can be fixed with relatively 
small changes.

I also believe that issue 443 is related and one of my patches fixes tcmalloc 
on at least example program posted there.

Fixes can be seen here 
https://github.com/alk/gperftools/commits/some-fragmentation-avoidance or in 
attachments below (in git format but applicable via patch -p1).

Original comment by alkondratenko on 9 Mar 2013 at 1:55

Changed state: Accepted

Attachments:

GoogleCodeExporter commented 9 years ago

Hi, thanks for the patches. Indeed I checked the patches on a PPC64 and x86_64 
and they seems to fix the issue.

Related to 0001 patch I just what kind of performance issue might arise from 
constantly trying to release all the pages when a new pagehead is requested.

Related to 0002, although it does not fix 443 (I checked on a PPC64 and x86_64 
and to fix 443 I think my patch is still needed) it does help on reducing the 
virtual memory usage. Just some question: why did you pick 8MB? Do you think it 
is worth to an and environment variable to try tune it of 8MB is already 
suffice for most cases?

Original comment by zatr...@gmail.com on 10 Mar 2013 at 4:03

GoogleCodeExporter commented 9 years ago

I was somewhat concerned by potential performance degradation too. But my 
thinking is that grow heap should be infrequent, so so some additional work 
before calling grow heap should be irrelevant.

Another thing to consider is minor page fault that we will get from those just 
unmapped spans. But there again my thinking is: unless we're badly fragmented 
we should be doing GrowHeap with really few free spans, so normally count of 
released pages should be small. If we are fragmented then unmapping those spans 
may help us coalesce them with other free and unmapped spans, and if it does 
not then at very least we will release actual memory back to OS, eating just 
virtual address space.

Regarding 8 megs. I've played with various sizes and I was observing number of 
free spans after free phase is done. With 128 megs I was seeing about 10 large 
spans after first pass, which have grown to about 20 in second pass. My 
understanding is those were mostly caused by some caching of free lists. With 8 
megs I was seeing decent count of free spans (AFAIR about 32 on first pass) and 
I believe 8 is small enough to be fine for smaller environments such as ARMs.

I've just came up with idea of starting with small batches and as metadata 
usage grows (as part of general heap grows) we can start using massive batches.

Let me also note that in second patch I believe I'm missing some locking. While 
malloc metadata appears to be added under it's own higher-level locks, I've 
found other users of metadata allocation function in non-malloc code so I 
believe mutual exclusion is required.

Original comment by alkondratenko on 10 Mar 2013 at 6:37

GoogleCodeExporter commented 9 years ago

I agree with your comments, I'd just wonder if there is a performance testcase 
we can use to evaluate if this is a hit in performance. However the testcase 
trigger a nasty issue and it should fixed. 

I checked here and besides the page heap code itself, I noticed two other place 
that use MetaDataAlloc utilization:

1. Static::InitStaticVars and there is no need to locking on this phase.
2. ThreadCache::CreateCacheIfNecessary and its access is locked.

Did i overlooking something here?

Original comment by zatr...@gmail.com on 10 Mar 2013 at 2:24

GoogleCodeExporter commented 9 years ago

It's used by PageHeapAllocator which is used in number of places. But you may 
be right that they are all guarded already. On the other hand some future user 
of MetaDataAlloc may be unaware that it's not thread safe and given current 
implementation grabs locks anyways (inside kernel as part of sbrk or mmap plus 
tcmalloc's spinlock in system_alloc.cc) I believe some extra guarding and safer 
semantics may be still good idea. I'd like to think about it a bit more.

Regarding possible regression I agree. Give me a bit of time to see what's 
worst case behavior is to make sure we're not making any case worse. Or at the 
very least we should understand how bad possible downside is.

Original comment by alkondratenko on 11 Mar 2013 at 7:36

GoogleCodeExporter commented 9 years ago

Yeah, both references I cited use the PageHeapAllocator. And I don't feel 
inclined to add more locking where there is no need: I believe it will better 
to add a comment saying the MetaDataAlloc should always be used guarded by 
locks.

Original comment by zatr...@gmail.com on 11 Mar 2013 at 7:56

GoogleCodeExporter commented 9 years ago

I think worst case is page fault per page allocated/freed in short running 
programs that don't reach stable virtual address space usage. I'm going to 
limit badness by only releasing everything back if grow heap crosses certain 
bound. I.e. something like vm-size-before div 128M != vm-size-after div 128M.

Original comment by alkondratenko on 15 Mar 2013 at 5:09

GoogleCodeExporter commented 9 years ago

Here's latest revision. Changes are:

* locking around internals of MetaDataAlloc (see more below)

* more conservative logic of releasing memory back to OS. See my comment above 
and full details are in code.

Regarding locking. Even if all users of MetaDataAlloc were guarded by locks I'm 
not seeing any one lock that they are all using. Particularly it's clear that 
most users of MetaDataAlloc grab pageheap_lock, but I've found that 
stacktrace_allocator is actually used without holding this lock.

Original comment by alkondratenko on 1 Apr 2013 at 5:03

Attachments:

GoogleCodeExporter commented 9 years ago

Hi, thanks for the patch. Sorry for the delay on revision, and I have some 
questions:

0001-issue-368-unmap-free-spans-and-retry-before-growing-.patch : I kind 
confused how
did you chose the conditions to test for pages release:

- stats_.free_bytes + stats_.unmapped_bytes >= stats_.system_bytes / 4 -> Why 
did you
  chose div 4 in these case? Could you elaborate?
- stats_.system_bytes / kForcedCoalesceInterval != (stats_.system_bytes + (n << 
kPageShift)) / kForcedCoalesceInterval) -> since both term are divided by
'kForcedCoalesceInterval', the division is not really needed.

- On ReleaseAtLeastNPages(static_cast<Length>(0x7fffffff)) I believe you meant 
to
  release as much pages as possible, right? Since Lenght is a uintptr_t, should it
  be 0xffffffff for 32 bits and 0xffffffffffffffff for 64 bits?

Original comment by zatr...@gmail.com on 9 Apr 2013 at 4:44

GoogleCodeExporter commented 9 years ago

On ReleaseAtLeastNPages call. Recall that it's not in bytes but in pages. So 
big 32-bit value should be big enough.

Division by kForcedCoalesceInterval is required. Don't forget it's integer 
division. This code checks if we're crossing one of x % kForcedCoalesceInterval 
== 0 boundaries.

Division by 4 is part of check to not bother (and risk quadratic behavior) if 
amount of free memory is relatively small.

Original comment by alkondratenko on 9 Apr 2013 at 5:38

GoogleCodeExporter commented 9 years ago

Understood, I did some confusion about the values ReleaseAtLeastNPages. I was 
just wondering if it's worth on adding a configurable parameter for this.

I did some internal benchmarks and results looks ok. I'm fine with the patch.

Original comment by zatr...@gmail.com on 11 Apr 2013 at 3:40

GoogleCodeExporter commented 9 years ago

Thanks for review. Somehow I didn't receive any notification from your comment. 
Will commit soon. Thanks a lot again.

Original comment by alkondratenko on 6 May 2013 at 7:36

GoogleCodeExporter commented 9 years ago

Committed into svn

Original comment by alkondratenko on 6 May 2013 at 7:51

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I believe you forgot to add 'large_heap_fragmentation_unittest.cc', the make is 
failing by not finding this file.

Original comment by zatr...@gmail.com on 7 May 2013 at 7:24

GoogleCodeExporter commented 9 years ago

The handling of MetaDataAlloc for bytes >= kMetadataAllocChunkSize looks 
incorrect (0002-issue-368-443-allocate-metadata-in-big-batches.patch)
- kMetadataBigAllocThreshold bytes is allocated, regardless of 'bytes' value.

Original comment by pafi...@gmail.com on 7 May 2013 at 8:53

GoogleCodeExporter commented 9 years ago

Indeed I messed up putting this into svn. large_heap_fragmentation_unittest.cc 
is now added. Thanks for pointing this out

Original comment by alkondratenko on 7 May 2013 at 10:25

GoogleCodeExporter commented 9 years ago

Thanks, pafinde... indeed you're right. Fixed. Thanks.

Original comment by alkondratenko on 7 May 2013 at 10:29

GoogleCodeExporter commented 9 years ago

On a CentOS 5.2 x64, I tried running crash-tcmalloc with the latest version of 
tcmalloc (rev 218), I still see virt size growing to 1.2G and stays there. 
OTOH, when I run it with glibc malloc, the virt size grows to 528MB and after 
that virt size drops to 50k. How does glibc do this? Is there a way tcmalloc 
can also do the same?

RES size is of course small for both (< 4kbytes)

The reason I'm concerned with virt size is that we have an application whose 
virt size grows to > 16GB on some workloads where the RES size is ~ 9GB. Any 
time the application crashes, the core size is the virt size. It becomes 
unmanageable when this happens in customer deployments.

Original comment by shash...@gmail.com on 19 May 2013 at 3:47

GoogleCodeExporter commented 9 years ago

Tcmalloc approach is not always most fragmentation friendly. I'd like to know a 
bit more about what exactly you're trying to run. Also it would be nice if you 
could compare with 2.0.

Original comment by alkondratenko on 19 May 2013 at 7:39

GoogleCodeExporter commented 9 years ago

Issue 545 has been merged into this issue.

Original comment by alkondratenko on 13 Jul 2013 at 10:39

GoogleCodeExporter commented 9 years ago

Issue 544 has been merged into this issue.

Original comment by alkondratenko on 13 Jul 2013 at 10:40

qchbai / gperftools

Memory bloating issues on Linux (CentOS) #368