nodejs / node-v0.x-archive

Moved to https://github.com/nodejs/node
34.43k stars 7.31k forks source link

Use mmap for buffer allocations (was: Strange problem with Buffers and RSS memory) #4283

Closed tereska closed 11 years ago

tereska commented 11 years ago

We have a simple http server that makes Buffer objects (mostly with size > 8kb). RSS memory on a server machines is only growing. At some point in time, node process uses whole server memory.

I've made some simple script that exposes that behaviour:

  1. Make 10k Buffer objects
  2. delete variable holding buffers
  3. Make another 10k Buffer objects
  4. delete variable holding buffers
  5. use timer with some task to give time for GCs

Unfortunately RSS memory is not shrinking back to the starting memory usage. When I use small buffer sizes (<8k) memory is released.

Why such a behaviour? We dont have any memory leaks or something like that.

I know that not always memory will be given back to the system. But with node buffer memory usage, it never happens, and node process is taking up whole server memory.

Please help! ubuntu 11.10 and 12.04 both 32bit (node -> 0.4.12, 0.8.latest)

var data = new Array(20000).join('0');
var h = null;

// start memory usage
console.log(process.memoryUsage().rss/1024/1024); // < 10MB

h = {};
for(var i=0;i<10000;i++){
  h['key_' + i] = new Buffer(data);
}
console.log(process.memoryUsage().rss/1024/1024); // > 200MB
h = null;

var h = {};
for(var i=0;i<10000;i++){
  h['key_' + i] = new Buffer(data);
}
console.log(process.memoryUsage().rss/1024/1024); // > 200MB
h = null;

setInterval(function(){
  console.log(process.memoryUsage().rss/1024/1024);  // stays at > 200MB forever
  h = null;
  delete h;
  h = {}
  for(var i=0;i<100;i++){
    h['key_' + i] = i
  }
}, 1000);
piscisaureus commented 11 years ago

I see this behaviour on linux, but not on windows. It is probably an artefact of how glibc handles large heap (de)allocations. @bnoordhuis, thoughts?

@tereska I doubt that this is actually the cause of your memory blowing up. Even if freed memory isn't returned to the OS, it can be re-used for later buffer allocations. So if node ends up using all your server's memory, you probably allocated so many buffers at the same time at some point.

PS: it's generally considered helpful to post a test case (check), the product version (?) and your operating system (?) to bug reports. Now we're left guessing.

tereska commented 11 years ago

ubuntu 11.10 and 12.04 both 32bit (node -> 0.4.12, 0.8.latest)

fastman commented 11 years ago

Hi, In this example, when I increase Buffer.poolSize (https://github.com/joyent/node/blob/master/lib/buffer.js#L300), for 800 kb, so now it has:

Buffer.poolSize = 8 * 1024 * 100;

Memory is properly released to the system. What about this?

Tested on node v0.8.14, debian 6.0.6, 2.6.32-5-xen-amd64

bnoordhuis commented 11 years ago

Your test case is valid - for certain values of 'valid' - but the behavior you're seeing is not a bug, it's an artifact of how the malloc() implementation in glibc works. What is happening is that the allocated memory is not returned to the system. The 'why?' needs some explaining...

There are two ways for a program to request memory from the operating system, mmap() and brk().

mmap() claims a slab of memory somewhere in the address space of the process that can later be released to the operating system again.

brk() is different. It sets the "break", the address where the process "ends", initially address (code + data + bss). If you want to allocate 256 bytes, you call brk(current_brk + 256) and claim the memory between the old and the new break as your own.

glibc sometimes uses mmap(), sometimes brk() - it depends on a number of things. Allocation size is one of them.

The issue with brk() is that while in theory it's possible to release memory again, in practice that never happens and, AFAIK, glibc doesn't even try. If you trace your test case with strace -ce brk,mmap node test.js, you'll see numbers comparable to this:

  -nan    0.000000           0        83           mmap
  -nan    0.000000           0      1487           brk
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                  1570           total

In other words, nearly all allocations are done with brk() and they don't get returned to the OS.

When you increased Buffer.poolSize, you made glibc use mmap() instead.

I'm not going to change the default pool size in node.js though, because that strategy is unreliable at best. :-) For example, it doesn't affect the brk/mmap ratio with 64 bits builds.

Hope that clears it up for you.

piscisaureus commented 11 years ago

@bnoordhuis

Although definitely a defensible stance, we could consider to actually call mmap (maybe optionally) when allocating a buffer. Lately we have seen a lot of bug reports reporting this problem.

IMO this is a shortcoming of glibc - if your program allocates a lot of memory at some point in its lifetime, it will never be reclaimed; the unused memory may get paged out at some point, but that is essentially a waste of swap space and (hard drive) time.

Also note that this problem does not exist on windows or darwin. On Darwin brk() is only available for backward compatibility reasons, but libc never uses it. It is not even possible to move the break beyond 4MB. On Windows, brk() doesn't exist, and (by default) the heap is essentially managed by the OS.

bnoordhuis commented 11 years ago

Although definitely a defensible stance, we could consider to actually call mmap (maybe optionally) when allocating a buffer. Lately we have seen a lot of bug reports reporting this problem.

I guess that's an option. I'll reopen the issue for now.

bnoordhuis commented 11 years ago

we could consider to actually call mmap (maybe optionally) when allocating a buffer

Tentative patch here: bnoordhuis/node@82bc42a

Performance-wise it's mostly a wash. new/delete is 1-2% faster with 8k buffers, mmap is 0.5-1.0% faster with 32k buffers. That's with the micro-benchmarks in benchmark/fast_buffer* that allocate myriads of buffers in a tight loop.

Note that you need to run node with --expose-gc and patch the test case to call gc() from time to time to see the effects. On my system, RSS drops from 220M to 16-20M after a sweep.

kuebk commented 11 years ago

@bnoordhuis maybe we could create an argument which could be passed to node binary or an environment variable using which we could decide how big FastBuffer we want at startup?

bnoordhuis commented 11 years ago

how big FastBuffer we want at startup?

It's not about FastBuffers, it's about the size of the SlowBuffer that regular buffers are split off from.

Frankly, the initial SlowBuffer could be 1M. Who cares? Who'd notice?

youurayy commented 11 years ago

I tried both playing with Buffer.poolSize and also applying the patch posted by Ben here, but none seems to be working for me. I still need to triple-check my heap dumps to see if there could be any retention related to the RSS (if thats possible at all), but the heap size seems to be very stable, and the RSS size keeps growing, until the process is killed by the system when virtual memory eventually runs out. (Side note: the heap snapshot comparison in Chrome's dev console stopped to work for me recently, is there any other way to compare snapshots?)

I'm opening and closing a lot of HTTPS sessions, plus I'm using gzip encoding (via Node's zlib), so those two come to mind as possible candidates if it's not the Buffer.

@bnoordhuis, would it make sense to try to force mmap for all allocations (just to determine if it's really the issue we think it is), and if yes, where would be a good place to temporarily put the mallopt(M_MMAP_THRESHOLD, 0); call?

Also, is there any way to actually examine what or who is actually causing the RSS to "leak"? I am actually finding it hard to imagine that so much data would not be reclaimable for further allocation because of fragmentation — I mean, the malloc sizes must be fairly repeating in an RSS of 3 GB and more, so even if it's just brk(), the RSS should not be growing indefinitely. The heap sizes I am dealign with are just 30 to 60 MB large.

bnoordhuis commented 11 years ago

I'm opening and closing a lot of HTTPS sessions, plus I'm using gzip encoding (via Node's zlib), so those two come to mind as possible candidates if it's not the Buffer.

We fixed a memory leak in tls/https recently, see commit 51d5655. If that's not it, apply the principle of exclusion: test https without zlib and vice versa, zlib without https (i.e. over http).

Side note: the heap snapshot comparison in Chrome's dev console stopped to work for me recently

Yes, it's frequently broken. I usually downgrade when that happens. Using the stable channel instead of canary builds helps as well.

is there any other way to compare snapshots?

Not yet. I intend to write a CLI tool but I haven't had time for that yet.

youurayy commented 11 years ago

It was totally the tls/https memleak, thanks a million for suggesting it. Runs like a clock now. I will later try to run a Node with the tls/https memleak fix applied, but without the RSS/buffer workaround applied, to see if I got anything relevant to add to this particular issue. Thanks!

bnoordhuis commented 11 years ago

Fixed in 2433ec8.

isaacs commented 11 years ago

The fix for this causes a somewhat ridiculous performance regression on OS X and SmartOS. It causes a less pronounced, but still unacceptable performance regression on Linux.

If a better solution cannot be found in the short term, we'll have to revert 2433ec8, and leave this open for a while longer. The leaked memory is less problematic than the reduced performance.

EDIT: The performance hit is most visible in benchmark/net-pipe.js. It should print numbers about half of benchmark/throughput.js, but with this patch, drops to about 1/3 of the expected value on SmartOS, and around 1/2 on Darwin.

isaacs commented 11 years ago

I think the issue here is that we're mistaking intended system behavior for a "leak", because RSS is a very poor indicator of program memory usage. Perl has had this in their FAQ for many years: http://learn.perl.org/faq/perlfaq3.html#How-can-I-free-an-array-or-hash-so-my-program-shrinks-

There are occasionally actual C/C++ memory leaks (and JS reference leaks) in Node. But increasing RSS due to brk() calls is not a problem we ought to solve. We trade too much performance for not very much memory usage gain. Users who know that they will be creating a lot of small buffers can increase the Buffer.poolSize, or we can probably come up with a heuristic to dynamically enlarge it if it would be beneficial. But forcing mmap for every allocation is far too costly.

isaacs commented 11 years ago

Another strategy would be to re-use the memory allocated for SlowBuffer objects when they are reclaimed, rather than letting the OS take care of it. Of course, that introduces some added complexity, and the potential for actual memory leaks.

What would be ideal is if the OS could make this all Just Work, by letting us re-use memory that was ostensibly freed for subsequent allocs. (I'm talking with the SmartOS people at Joyent about some other solutions for this.)

bnoordhuis commented 11 years ago

Reverted for now in 6c5356b, re-roll in 8dc9dca.

It's interesting that using mmap() with the StreamWrap slab allocator causes a performance drop. It makes sense in a way; allocating large chunks is slower than allocating small chunks.

grzegorzlyczba commented 11 years ago

@bnoordhuis do you consider using alternative to glibc allocator? maybe jemalloc (http://oldblog.antirez.com/post/everything-about-redis-24)?

bnoordhuis commented 11 years ago

@soymo I've experimented with jemalloc in the past but it was consistently slower than glibc's malloc by about 2% in our benchmarks. As to why, I don't know - I didn't dive in too deeply.

bnoordhuis commented 11 years ago

I've decided not to land the mmap patch.

I ran extensive benchmarks on both quiescent and loaded systems. Using mmap(), even when finely tuned, is as often harmful as it's beneficial. The interaction with the regular slab allocator is also quite intricate to say the least.

The one place where it really seems to help is on systems with intense VM pressure; carefully pre-faulting the pages reduces the number of page faults by often ridiculous numbers (from e.g. 400,000/s to 2,000/s).

But we optimize for the common case and it's decidedly unclear if this patch is always a win. Ergo, WONTFIX.

isaacs commented 11 years ago

@bnoordhuis Thanks for looking into this so thoroughly. I guess we can just keep pointing people at the perl faq, or suggesting that they bump up the Buffer.poolSize. http://xkcd.com/386/

nicokaiser commented 11 years ago

Ok, increasing the Buffer.poolSize like proposed in https://github.com/joyent/node/issues/4283#issuecomment-10441587 did not help at all (far from it):

render-1

(This is the RSS memory of three WebSocket server. The yellow one is the server with Buffer.poolSize = 8 * 1024 * 100 – it was restarted while the others were running, therefore the increase until 23:00. At about 7:00 client count increased)