Possible memory leak in TLS(Wrap?)

aredridel commented 9 years ago

I'm diagnosing a problem when I run paypal's npm proxy https://github.com/krakenjs/kappa when run on iojs. After 300 requests for a 50kb tarball, memory allocated grows from 250 mb base to 450 mb.

I've run it under valgrind, and there's some things that look a bit suspicious around tlswrap, smalloc and some other crypto related bits. The interesting bits of the valgrind report are:

==31542== 827,216 bytes in 656 blocks are still reachable in loss record 1,214 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0x7450D8: CRYPTO_malloc (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x6C868B: asn1_enc_save (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x6C2866: ASN1_item_ex_d2i (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x6C14FF: asn1_template_noexp_d2i (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x6C1BAB: asn1_template_ex_d2i (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x6C2F0F: ASN1_item_d2i (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67CC9E: ssl3_get_server_certificate (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x681323: ssl3_connect (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67958B: ssl23_connect (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679E1A: ssl23_write (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDBCEF: _ZN4node7TLSWrap7ClearInEv.part.43 (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542== 
==31542== 1,343,488 bytes in 328 blocks are still reachable in loss record 1,215 of 1,223
==31542==    at 0x4C2AC27: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0xCD825D: node::NodeBIO::PeekWritable(unsigned long*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCD8EEC: node::TLSWrap::OnAllocImpl(unsigned long, uv_buf_t*, void*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xD031D0: uv__read (stream.c:1089)
==31542==    by 0xD03967: uv__stream_io (stream.c:1219)
==31542==    by 0xD08CEC: uv__io_poll (linux-core.c:319)
==31542==    by 0xCFA955: uv_run (core.c:324)
==31542==    by 0xC87650: node::Start(int, char**) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x5C8E76C: (below main) (libc-start.c:226)
==31542== 
==31542== 2,303,504 bytes in 131 blocks are still reachable in loss record 1,216 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0x7450D8: CRYPTO_malloc (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67BBAF: ssl3_setup_write_buffer (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67BCB1: ssl3_setup_buffers (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679000: ssl23_connect (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679CBA: ssl23_read (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDCE78: node::TLSWrap::ClearOut() (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDD2B4: node::TLSWrap::Start(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x801781: v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x828E02: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x1C66A79060BA: ???
==31542==    by 0x1C66A7CDC7AD: ???
==31542== 
==31542== 2,323,416 bytes in 131 blocks are still reachable in loss record 1,217 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0x7450D8: CRYPTO_malloc (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67BD5F: ssl3_setup_buffers (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679000: ssl23_connect (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679CBA: ssl23_read (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDCE78: node::TLSWrap::ClearOut() (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDD2B4: node::TLSWrap::Start(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x801781: v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x828E02: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x1C66A79060BA: ???
==31542==    by 0x1C66A7CDC7AD: ???
==31542==    by 0x1C66A7CDC562: ???
==31542== 
==31542== 3,464,048 bytes in 197 blocks are still reachable in loss record 1,218 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0x7450D8: CRYPTO_malloc (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67BBAF: ssl3_setup_write_buffer (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67BCB1: ssl3_setup_buffers (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679000: ssl23_connect (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679CBA: ssl23_read (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDCE78: node::TLSWrap::ClearOut() (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDD2B4: node::TLSWrap::Start(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x1C66A7A43345: ???
==31542==    by 0x1C66A7EA169A: ???
==31542==    by 0x1C66A7CDC562: ???
==31542==    by 0x1C66A7924BC5: ???
==31542== 
==31542== 3,493,992 bytes in 197 blocks are still reachable in loss record 1,219 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0x7450D8: CRYPTO_malloc (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x67BD5F: ssl3_setup_buffers (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679000: ssl23_connect (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x679CBA: ssl23_read (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDCE78: node::TLSWrap::ClearOut() (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDD2B4: node::TLSWrap::Start(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x1C66A7A43345: ???
==31542==    by 0x1C66A7EA169A: ???
==31542==    by 0x1C66A7CDC562: ???
==31542==    by 0x1C66A7924BC5: ???
==31542==    by 0x1C66A7C08177: ???
==31542== 
==31542== 5,373,952 bytes in 328 blocks are still reachable in loss record 1,220 of 1,223
==31542==    at 0x4C2AC27: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0xCD8325: node::NodeBIO::Commit(unsigned long) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDD745: node::TLSWrap::OnReadImpl(long, uv_buf_t const*, uv_handle_type, void*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xD0335E: uv__read (stream.c:1156)
==31542==    by 0xD03967: uv__stream_io (stream.c:1219)
==31542==    by 0xD08CEC: uv__io_poll (linux-core.c:319)
==31542==    by 0xCFA955: uv_run (core.c:324)
==31542==    by 0xC87650: node::Start(int, char**) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x5C8E76C: (below main) (libc-start.c:226)
==31542== 
==31542== 7,422,722 bytes in 131 blocks are still reachable in loss record 1,221 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0xCAD6A2: node::smalloc::Alloc(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x801781: v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x828E02: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x1C66A79060BA: ???
==31542==    by 0x1C66A79B20B8: ???
==31542==    by 0x1C66A791EA54: ???
==31542==    by 0x1C66A791EDA0: ???
==31542==    by 0x1C66A7D088BE: ???
==31542==    by 0x1C66A7D084D4: ???
==31542==    by 0x1C66A7D08272: ???
==31542==    by 0x1C66A7924BC5: ???
==31542== 
==31542== 11,162,414 bytes in 197 blocks are still reachable in loss record 1,222 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0xCAD6A2: node::smalloc::Alloc(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x801781: v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x828E02: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x1C66A79060BA: ???
==31542==    by 0x1C66A79B20B8: ???
==31542==    by 0x1C66A791EA54: ???
==31542==    by 0x1C66A791EDA0: ???
==31542==    by 0x1C66A7EA5607: ???
==31542==    by 0x1C66A7D08272: ???
==31542==    by 0x1C66A7924BC5: ???
==31542==    by 0x1C66A7C08177: ???
==31542== 
==31542== 18,591,935 bytes in 20,677 blocks are still reachable in loss record 1,223 of 1,223
==31542==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31542==    by 0xCD8E91: node::TLSWrap::OnAllocSelf(unsigned long, uv_buf_t*, void*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDCEA0: node::TLSWrap::ClearOut() (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xCDD7B7: node::TLSWrap::OnReadImpl(long, uv_buf_t const*, uv_handle_type, void*) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0xD0335E: uv__read (stream.c:1156)
==31542==    by 0xD03967: uv__stream_io (stream.c:1219)
==31542==    by 0xD08CEC: uv__io_poll (linux-core.c:319)
==31542==    by 0xCFA955: uv_run (core.c:324)
==31542==    by 0xC87650: node::Start(int, char**) (in /home/aredridel/t/node_modules/iojs-bin/node_modules/iojs-linux-x64/iojs-v1.4.2-linux-x64/bin/iojs)
==31542==    by 0x5C8E76C: (below main) (libc-start.c:226)
==31542== 
==31542== LEAK SUMMARY:
==31542==    definitely lost: 10,464 bytes in 327 blocks
==31542==    indirectly lost: 63,532 bytes in 3,258 blocks
==31542==      possibly lost: 4,568 bytes in 64 blocks
==31542==    still reachable: 69,011,541 bytes in 206,575 blocks
==31542==         suppressed: 0 bytes in 0 blocks
==31542== 
==31542== For counts of detected and suppressed errors, rerun with: -v
==31542== ERROR SUMMARY: 42 errors from 42 contexts (suppressed: 2 from 2)

Any help or suggestions for what to look at to track this down would be wonderful.

Node 0.12 and 0.10 don't show this behavior with the same code (we're stress-testing io.js in parallel to shake stuff like this loose)

indutny commented 9 years ago

@rvagg also, considering that 0.10 has climbed to the levels of iojs, and 0.12 is below - it might be that paypal's proxy was encountering un-even load (cc @jasis). It appears to be never the case that 0.10 was higher that 0.12 in previous graphs.

indutny commented 9 years ago

Nah, ignore last sentence, but anyway - it could be just uneven load.

jasisk commented 9 years ago

Uneven load sounds right. A ton of automated builds kicking off every hour is what drives those spikes, which are a lot less prominent during the week. Plus, I'm disproportionately driving traffic to those machines. They're in a round-robin but still. I'll pull some stats to confirm the theory but I'll be putting everything back in the pool and restarting tonight so we should have better numbers by tomorrow afternoon.

indutny commented 9 years ago

All of commits has landed in v1.x at this point.

jasisk commented 9 years ago

A small update after a day. Looks like it's still leaking a bit.

red: 0.12, green: 0.10, purple: iojs. rss

indutny commented 9 years ago

Noooo!!!! Wtf!!! :) Will look into it tomorrow. Anyway it seems to be much better than it was, and at least we identified major source of it.

indutny commented 9 years ago

@jasisk please continue your observations, I still have best hopes for that it'll stabilize at some point :)

chrisdickinson commented 9 years ago

This might be of interest as well, @wraithan and I tracked down a state where SSL_free wasn't getting called on v0.10 streams.

indutny commented 9 years ago

@chrisdickinson this is very unlikely to be hit in io.js

indutny commented 9 years ago

@jasisk could you please take a look at screen shot 2015-03-09 at 10 48 18 ? It seems that there is some retained sockets in kappa's code. I remember that you told me about the leaks that you had, I wonder if this is something expected or if it is caused by some behavior changed in io.js.

jasisk commented 9 years ago

Intended to do some profiling today (particularly after @chrisdickinson's comment), will let you know.

For clarity (since I mentioned this to Fedor over IRC): we've had a slow leak in kappa for years, on the order of 10s of mbs after months of uptime. For our purposes, it's proven largely inconsequential so it's dropped off our list time and again. Time to look into it. :)

indutny commented 9 years ago

Heh, thanks @jasisk ! I can confirm some sort of leak on my side, going to let RSS grow to 1gb and then will try to figure something out of the heapsnapshot locally.

indutny commented 9 years ago

At least, valgrind seems to be happy at this point.

indutny commented 9 years ago

Nah, it stabilized at 450mb RSS. Going to take a look at heapsnapshot anyway

indutny commented 9 years ago

@jasisk I was unable to identify any relevant information in the heapsnapshot. Do you have any further graphs of RSS on your instance of kappa? Did it stabilize?

jasisk commented 9 years ago

Same from my end.

Fresh start this morning (deployed 1a3ca82) and things haven't completely leveled off yet. That said, iojs is right in the mix which—previously—wasn't the case at this point in the day. Much more confident in saying you've caught it!

0.10 in purple, 0.12 in yellow, iojs in red. RSS

indutny commented 9 years ago

I guess we could assume it is resolved now? In any case - feel free to reopen this ;)

On Monday, March 9, 2015, Jean-Charles Sisk notifications@github.com wrote:

Same from my end.

Fresh start this morning and things haven't completely leveled off yet. That said, iojs is right in the mix which—previously—wasn't the case at this point in the day. Much more confident in saying you've caught it!

0.10 in purple, 0.12 in yellow, iojs in red. [image: RSS] https://cloud.githubusercontent.com/assets/62923/6563251/25edf830-c676-11e4-8d83-233d82af7e66.png

— Reply to this email directly or view it on GitHub https://github.com/iojs/io.js/issues/1075#issuecomment-77931073.

jasisk commented 9 years ago

:+1: thanks again!

indutny commented 9 years ago

Thank you!

jasisk commented 9 years ago

Following up now that a few days have elapsed. While things are looking significantly better, there still appears to be a leak.

purple: iojs ~v1.5.1 (hash mentioned above), blue: 0.12, red: 0.10 rss

trevnorris commented 9 years ago

Thanks @jasisk. Going to repoen.

indutny commented 9 years ago

WHY?! :disappointed:

trevnorris commented 9 years ago

@jasisk Do you have graph data over more time? The line looks like it leveled off at 1.1GB. Not saying that's acceptable, but it's curious why a near linear growth would have done that.

jasisk commented 9 years ago

Was looking at that, myself. A tiny bit more data—and we just spiked across all instances:

rss

At 1.2gb, I have no reason to kick any of these over. I'll keep things running and check back in when it's clear a trend has emerged.

indutny commented 9 years ago

@jasisk this is very frustrating, I had best hopes for the latest patch. Since I'm unable to reproduce this leak locally, may I ask you to install heapdump module, require('heapdump') from somewhere in code and try sending SIGUSR2 a couple of times during exposure of this leaks (better if it'd be significantly higher than v0.12 and v0.10)?

I wonder if it'll fall off to the v0.10/v0.12 levels after SIGUSR2? heapdump is starting a GC internally, so it might force-collect some stuff. Also, it would be great to take a look at the heapsnapshot itself too. Please contact me privately if you wish to do this, since the snapshot might contain sensitive data.

Thank you for a follow-up!

jasisk commented 9 years ago

@indutny Yup. I was convinced of the fix 3 days ago so I never did any profiling, locally. after letting these processes go for a few more hours, I'll pull them down and instrument them. Will take a first pass at what I find and, if nothing obvious jumps out, I'll follow up elsewhere with you.

indutny commented 9 years ago

@jasisk thank you! This sounds awesome.

jasisk commented 9 years ago

Pulling down a couple of machines to instrument them but figured I'd give one more update before I do (iojs vs 0.12).

Still trending upward, unfortunately. Will update if / when the heap snapshots give me any insight.

rss

trevnorris commented 9 years ago

@jasisk Have you changed any flags like --max-old-space-size? If not the process should either be butting up against GC, if there's a problem in the V8 heap, or there's an external allocation run amok.

jasisk commented 9 years ago

@trevnorris nope. No flags.

Heap dumps aren't showing anything out of the ordinary.

Here's the used heap graph for the iojs instance (yellow is averaging over 2 hours): screen shot 2015-03-12 at 6 07 19 pm

trevnorris commented 9 years ago

@indutny seems like external allocations are leaking somewhere. thoughts?

indutny commented 9 years ago

@trevnorris don't have any thoughts yet, fighting with https://github.com/iojs/io.js/pull/1140 at the moment.

Will take a look later today.

indutny commented 9 years ago

@jasisk not sure how I missed it last time. May I ask you to give a try to this patch?

diff --git a/src/tls_wrap.cc b/src/tls_wrap.cc
index 49523bc..495729c 100644
--- a/src/tls_wrap.cc
+++ b/src/tls_wrap.cc
@@ -310,8 +310,10 @@ void TLSWrap::EncOut() {
   write_req->Dispatched();

   // Ignore errors, this should be already handled in js
-  if (!r)
+  if (!r) {
     NODE_COUNT_NET_BYTES_SENT(write_size_);
+    write_req->Dispose();
+  }
 }

indutny commented 9 years ago

@jasisk updated the diff, there was a typo!

indutny commented 9 years ago

Though, this might be not enough by itself. @jasisk : what if I'll give you a patch that logs some metrics into stderr, would it be possible to obtain this output from your testing environment?

jasisk commented 9 years ago

Awesome! Will apply it shortly and report back tomorrow. Thanks again!

rvagg commented 9 years ago

See https://github.com/iojs/io.js/issues/1151#issuecomment-79555929 from @brycebaril. Not sure if this would accrue over runtime but perhaps it's getting caught up in stats collection @ https://github.com/krakenjs/kappa/blob/ff14ef91ba39472ca07d9f8f8c009eab9875071f/lib/stats.js#L57-L66

jasisk commented 9 years ago

I would expect to see a more periodic growth in that case. It's pretty weak but, if I squint, I see the growth correlating with spikes in requests.

indutny commented 9 years ago

@jasisk put the same patch in a PR: https://github.com/iojs/io.js/pull/1154

indutny commented 9 years ago

@jasisk here is the patch for logging: https://github.com/indutny/io.js/commit/01170cbe71edc38710a20173f3e67a0e036c1fcf , it will print following stuff every 5 seconds:

>>>> "43462684600589" counter "TLSWrap count", value "126"
>>>> "43462694565717" counter "WriteWrap count", value "1"

It should be enough to get these values when the RSS will spike, but it would be helpful to observe them over time too, if you have resources for this ;)

Thank you!

indutny commented 9 years ago

@jasisk gosh patch in #1154 was a bit incorrect :) Here is a proper patch that landed https://github.com/iojs/io.js/commit/e90ed790c340af01b0b9daa7dec6ff52caccde77

indutny commented 9 years ago

@jasisk May I ask you to try these two patches together: https://github.com/indutny/io.js/commit/01170cbe71edc38710a20173f3e67a0e036c1fcf and https://github.com/indutny/io.js/commit/9f233aa4050808af4876ee5920df3436bfbe3cd1 . It should log number of alive smalloc buffers. I wonder if we are unintentionally retaining any.

indutny commented 9 years ago

According to @jasisk there is no leaks of TLSWrap/WriteWrap instances. Hopefully there are some Smalloc leaks and we'll be able to go somewhere from there ;)

jasisk commented 9 years ago

Update graph of e90ed79 + indutny@01170cb (yellow is io.js, fairly light traffic until this morning): RSS

Logged data showed TLSWrap / WriteWrap getting collected.

Building and deploying with the smalloc logging now.

trevnorris commented 9 years ago

This might be an issue with vm or related. From a conversation with @indutny:

Mar 05 16:38:34 <indutny>   I'm getting tons of ContextifyScript WeakCallback invokations
Mar 05 16:38:37 <indutny>   I wonder what this could mean
Mar 05 16:44:59 <indutny>   it does leak even without networking at all

Olegas commented 9 years ago

@jasisk Have you any data for v8's heapUsed/heapTotal for the same time (process.memoryUsage())?

indutny commented 9 years ago

@trevnorris oh right! I forgot about it

indutny commented 9 years ago

@trevnorris doesn't look to be the case now, though.

jasisk commented 9 years ago

@Olegas yup. Don't have much data to go off of in the latest deploy—and we're not yet seeing the sharp increase—but here's my heap data.

blue: io.js, red: 0.12.0 center line is mean over 30m.

heap

indutny commented 9 years ago

@jasisk is there any new data on your side? That would help a lot, thank you!

nodejs / node

Possible memory leak in TLS(Wrap?) #1075