Closed aredridel closed 9 years ago
@rvagg also, considering that 0.10 has climbed to the levels of iojs, and 0.12 is below - it might be that paypal's proxy was encountering un-even load (cc @jasis). It appears to be never the case that 0.10 was higher that 0.12 in previous graphs.
Nah, ignore last sentence, but anyway - it could be just uneven load.
Uneven load sounds right. A ton of automated builds kicking off every hour is what drives those spikes, which are a lot less prominent during the week. Plus, I'm disproportionately driving traffic to those machines. They're in a round-robin but still. I'll pull some stats to confirm the theory but I'll be putting everything back in the pool and restarting tonight so we should have better numbers by tomorrow afternoon.
All of commits has landed in v1.x at this point.
A small update after a day. Looks like it's still leaking a bit.
red: 0.12, green: 0.10, purple: iojs.
Noooo!!!! Wtf!!! :) Will look into it tomorrow. Anyway it seems to be much better than it was, and at least we identified major source of it.
@jasisk please continue your observations, I still have best hopes for that it'll stabilize at some point :)
This might be of interest as well, @wraithan and I tracked down a state where SSL_free
wasn't getting called on v0.10 streams.
@chrisdickinson this is very unlikely to be hit in io.js
@jasisk could you please take a look at ? It seems that there is some retained sockets in kappa's code. I remember that you told me about the leaks that you had, I wonder if this is something expected or if it is caused by some behavior changed in io.js.
Intended to do some profiling today (particularly after @chrisdickinson's comment), will let you know.
For clarity (since I mentioned this to Fedor over IRC): we've had a slow leak in kappa for years, on the order of 10s of mbs after months of uptime. For our purposes, it's proven largely inconsequential so it's dropped off our list time and again. Time to look into it. :)
Heh, thanks @jasisk ! I can confirm some sort of leak on my side, going to let RSS grow to 1gb and then will try to figure something out of the heapsnapshot locally.
At least, valgrind seems to be happy at this point.
Nah, it stabilized at 450mb RSS. Going to take a look at heapsnapshot anyway
@jasisk I was unable to identify any relevant information in the heapsnapshot. Do you have any further graphs of RSS on your instance of kappa? Did it stabilize?
Same from my end.
Fresh start this morning (deployed 1a3ca82) and things haven't completely leveled off yet. That said, iojs is right in the mix which—previously—wasn't the case at this point in the day. Much more confident in saying you've caught it!
0.10 in purple, 0.12 in yellow, iojs in red.
I guess we could assume it is resolved now? In any case - feel free to reopen this ;)
On Monday, March 9, 2015, Jean-Charles Sisk notifications@github.com wrote:
Same from my end.
Fresh start this morning and things haven't completely leveled off yet. That said, iojs is right in the mix which—previously—wasn't the case at this point in the day. Much more confident in saying you've caught it!
0.10 in purple, 0.12 in yellow, iojs in red. [image: RSS] https://cloud.githubusercontent.com/assets/62923/6563251/25edf830-c676-11e4-8d83-233d82af7e66.png
— Reply to this email directly or view it on GitHub https://github.com/iojs/io.js/issues/1075#issuecomment-77931073.
:+1: thanks again!
Thank you!
Following up now that a few days have elapsed. While things are looking significantly better, there still appears to be a leak.
purple: iojs ~v1.5.1 (hash mentioned above), blue: 0.12, red: 0.10
Thanks @jasisk. Going to repoen.
WHY?! :disappointed:
@jasisk Do you have graph data over more time? The line looks like it leveled off at 1.1GB. Not saying that's acceptable, but it's curious why a near linear growth would have done that.
Was looking at that, myself. A tiny bit more data—and we just spiked across all instances:
At 1.2gb, I have no reason to kick any of these over. I'll keep things running and check back in when it's clear a trend has emerged.
@jasisk this is very frustrating, I had best hopes for the latest patch. Since I'm unable to reproduce this leak locally, may I ask you to install heapdump
module, require('heapdump')
from somewhere in code and try sending SIGUSR2
a couple of times during exposure of this leaks (better if it'd be significantly higher than v0.12 and v0.10)?
I wonder if it'll fall off to the v0.10/v0.12 levels after SIGUSR2
? heapdump
is starting a GC internally, so it might force-collect some stuff. Also, it would be great to take a look at the heapsnapshot itself too. Please contact me privately if you wish to do this, since the snapshot might contain sensitive data.
Thank you for a follow-up!
@indutny Yup. I was convinced of the fix 3 days ago so I never did any profiling, locally. after letting these processes go for a few more hours, I'll pull them down and instrument them. Will take a first pass at what I find and, if nothing obvious jumps out, I'll follow up elsewhere with you.
@jasisk thank you! This sounds awesome.
Pulling down a couple of machines to instrument them but figured I'd give one more update before I do (iojs vs 0.12).
Still trending upward, unfortunately. Will update if / when the heap snapshots give me any insight.
@jasisk Have you changed any flags like --max-old-space-size
? If not the process should either be butting up against GC, if there's a problem in the V8 heap, or there's an external allocation run amok.
@trevnorris nope. No flags.
Heap dumps aren't showing anything out of the ordinary.
Here's the used heap graph for the iojs instance (yellow is averaging over 2 hours):
@indutny seems like external allocations are leaking somewhere. thoughts?
@trevnorris don't have any thoughts yet, fighting with https://github.com/iojs/io.js/pull/1140 at the moment.
Will take a look later today.
@jasisk not sure how I missed it last time. May I ask you to give a try to this patch?
diff --git a/src/tls_wrap.cc b/src/tls_wrap.cc
index 49523bc..495729c 100644
--- a/src/tls_wrap.cc
+++ b/src/tls_wrap.cc
@@ -310,8 +310,10 @@ void TLSWrap::EncOut() {
write_req->Dispatched();
// Ignore errors, this should be already handled in js
- if (!r)
+ if (!r) {
NODE_COUNT_NET_BYTES_SENT(write_size_);
+ write_req->Dispose();
+ }
}
@jasisk updated the diff, there was a typo!
Though, this might be not enough by itself. @jasisk : what if I'll give you a patch that logs some metrics into stderr, would it be possible to obtain this output from your testing environment?
Awesome! Will apply it shortly and report back tomorrow. Thanks again!
See https://github.com/iojs/io.js/issues/1151#issuecomment-79555929 from @brycebaril. Not sure if this would accrue over runtime but perhaps it's getting caught up in stats collection @ https://github.com/krakenjs/kappa/blob/ff14ef91ba39472ca07d9f8f8c009eab9875071f/lib/stats.js#L57-L66
I would expect to see a more periodic growth in that case. It's pretty weak but, if I squint, I see the growth correlating with spikes in requests.
@jasisk put the same patch in a PR: https://github.com/iojs/io.js/pull/1154
@jasisk here is the patch for logging: https://github.com/indutny/io.js/commit/01170cbe71edc38710a20173f3e67a0e036c1fcf , it will print following stuff every 5 seconds:
>>>> "43462684600589" counter "TLSWrap count", value "126"
>>>> "43462694565717" counter "WriteWrap count", value "1"
It should be enough to get these values when the RSS will spike, but it would be helpful to observe them over time too, if you have resources for this ;)
Thank you!
@jasisk gosh patch in #1154 was a bit incorrect :) Here is a proper patch that landed https://github.com/iojs/io.js/commit/e90ed790c340af01b0b9daa7dec6ff52caccde77
@jasisk May I ask you to try these two patches together: https://github.com/indutny/io.js/commit/01170cbe71edc38710a20173f3e67a0e036c1fcf and https://github.com/indutny/io.js/commit/9f233aa4050808af4876ee5920df3436bfbe3cd1 . It should log number of alive smalloc buffers. I wonder if we are unintentionally retaining any.
According to @jasisk there is no leaks of TLSWrap/WriteWrap instances. Hopefully there are some Smalloc leaks and we'll be able to go somewhere from there ;)
Update graph of e90ed79 + indutny@01170cb (yellow is io.js, fairly light traffic until this morning):
Logged data showed TLSWrap / WriteWrap getting collected.
Building and deploying with the smalloc logging now.
This might be an issue with vm or related. From a conversation with @indutny:
Mar 05 16:38:34 <indutny> I'm getting tons of ContextifyScript WeakCallback invokations
Mar 05 16:38:37 <indutny> I wonder what this could mean
Mar 05 16:44:59 <indutny> it does leak even without networking at all
@jasisk Have you any data for v8's heapUsed/heapTotal for the same time (process.memoryUsage()
)?
@trevnorris oh right! I forgot about it
@trevnorris doesn't look to be the case now, though.
@Olegas yup. Don't have much data to go off of in the latest deploy—and we're not yet seeing the sharp increase—but here's my heap data.
blue: io.js, red: 0.12.0 center line is mean over 30m.
@jasisk is there any new data on your side? That would help a lot, thank you!
I'm diagnosing a problem when I run paypal's npm proxy https://github.com/krakenjs/kappa when run on iojs. After 300 requests for a 50kb tarball, memory allocated grows from 250 mb base to 450 mb.
I've run it under valgrind, and there's some things that look a bit suspicious around tlswrap, smalloc and some other crypto related bits. The interesting bits of the valgrind report are:
Any help or suggestions for what to look at to track this down would be wonderful.
Node 0.12 and 0.10 don't show this behavior with the same code (we're stress-testing io.js in parallel to shake stuff like this loose)