Memory leak - Githubissues

icook commented 10 years ago

Long, high load instances are eating more and more RAM as they run, as much as 4GB after several weeks at a few gigahash. This is possibly caused by:

Jobs not getting properly GCed. Likely a reference loop that Python cannot break.
Long connected peers keeping huge job mapping lists. Less likely.

icook commented 10 years ago

As an update on this, I've added a few debugging tools to the latest version and have been running it on our lowest use port. I tried using the tools while not running live, but it was difficult to tell what was "leaking" without some bigger numbers/longer runtimes.

The tool I've installed and used thus far dumps a list of how many instances of each type of object are held in memory. Then when you run it again, it shows how many more there are now vs last time you ran it.

Results are roughly:

We seem to be very slowly leaking Transaction objects. I doubt this is the cause of the big leak. My guess is that something in the reporting engine holds reference to the BlockTemplate long term, although I'm not sure.
Weakrefs are growing, which is no big surprise since that's what the job mapper dictioary holds. We knew this was growing infinitely. Still unlikely to account for gigabytes of RAM, unless I misunderstand how weakrefs are operating...
We are primarily growing tuples. Like, a lot of tuples. I think the growth was something like 1000x the number of weakrefs we'd grown.

I've added some code to dump more information about the tuples, and will wait and see the results.

icook commented 9 years ago

It appears that it may actually be just the job mapper, which would make me feel silly. I new entry goes in the job mapper (as a 2-tuple) every job push or flush. We're doing about 370 push/flush actions per hour on litecoin servers, which with 383 workers (on the biggest litecoin port) is 3.4 million tuples a day, assuming no one disconnects.

I ran a quick test and it showed that 10 million of these tuples (including the weakref, etc) takes ~1 GB of ram. So if this is the culprit we should see something like 200-300Mb a day increase in usage in the largest server. I'm proposing the next step is to resolve this problem and then re-eval.

sbwdlihao commented 9 years ago

yes, we should remove old jobs from job mapper

ericecook commented 9 years ago

confirmed fixed by #98?

icook commented 9 years ago

Are we running this in prod anywhere?

ericecook commented 9 years ago

Pretty sure most of the vanilla coin stratums are running 6.0, so yes

ericecook commented 9 years ago

Oh I just realized you might have meant the patched PP. The PS ports are running 0.6.1 - not sure if it includes this patch or not

icook commented 9 years ago

Right, sorry I wasn't very clear on my question. I don't think we're running this anywhere, and until we are we won't be able to easily confirm it.

eightsixeight commented 9 years ago

still on master... its grows and i think its tyhe scheduler...

icook commented 9 years ago

@Fcases since master changes frequently could you provide the version number and more specific details?

eightsixeight commented 9 years ago

no idea, i git pulled today and restarted pp, as seen on #irc

PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 2428 root       20   0  473M  190M  4644 S  1.3  4.8  0:18.33 powerpool_0
 2428 root       20   0  554M  271M  4644 S  0.7  6.9  0:26.85 powerpool_0
 2428 root       20   0 1545M 1262M  4644 S  1.3 31.9  2:50.15 powerpool_0

and going, im on master as of right now...

one thing is i think you added gevents module in latest which i didnt have.. i redid requirments and testing...

all i know is a git log shows..

commit dd5f139626098c830f399db224948376aac286be Author: Isaac Cook isaac@simpload.com Date: Tue Dec 2 14:39:54 2014 -0600 as latest entr

icook commented 9 years ago

one thing is i think you added gevents module in latest which i didnt have

Gevent has been a requirement since the first commit of powerpool.

no idea, i git pulled today and restarted pp, as seen on #irc

Odd, what kind of traffic is that server seeing? Most of our instances don't bloat nearly that quickly, regardless of being on v0.6.3 or v0.6.2.

eightsixeight commented 9 years ago

nothing much 2 miners 50mhs, anything you can think of i can do to figure out ?

tried valgrind but i don't know much about that

icook commented 9 years ago

@ericecook Is this still occuring? I don't believe we've had issues with this anymore.

ericecook commented 9 years ago

@icook I'm unsure. We haven't had memory issues since moving the scheduler out to cron jobs

icook commented 9 years ago

Ah, right. That should probably be documented in simplecoin_multi now that I think about it...

I'll go ahead and close this since it seems to be resolved by #98.

simplecrypto / powerpool

Memory leak #93