Performance and Memory Issues (Worse now with 3.2.0)

tolmasky commented 8 years ago

Hi guys,

We've been using lru-cache in our product for a couple weeks now in tonicdev and have noticed some performance and memory issues. We use it all the time and love the work you guys are doing. For some context, we set the size of the cache to be 64 * 1000 * 1000 (~64MB). Our cache is storing string for keys and values (path1 -> path2). So the length calculation is just key.length + value.length (this isn't super accurate since its not 1 byte per char, but its kind of on inconsequential for what we're experiencing). So basically the cache can have as many as like 800,000 items.

Basically, we are seeing an explosion of memory use, and separately severe performance degradation once we hit the size limitation of the cache (the work of a cache miss far outweighs the benefits for cache hits). We are currently running 2.7.0, but this has gotten (significantly) worse in 3.2.0. @pouwerkerk has collected the results of a reduction we wrote (which measures worst case performance: constantly missing the cache and thus always resulting in one removal):

screen shot 2015-12-14 at 3 25 56 pm

Here you can see that once we hit the magic cutoff where we have reached the limitation of the cache size, performance degrades to up to 10s of milliseconds per retrieval (whereas before that point it had sub-millisecond performance).

Now this is the performance of 3.2.0:

screen shot 2015-12-14 at 3 31 26 pm

We believe the culprit is the new reverseKeys function which seems to create an array of the entire size of the cache on every call.

Either way, we have taken the time to write an alternate implementation that uses a doubly linked list (we based it on ideas here: http://www.geeksforgeeks.org/implement-lru-cache/ ) and thus actually slightly improves performance once the max cache size is hit (since no more new objects ever need to be created - graph below, 3.2.0 left out):

screen shot 2015-12-14 at 3 33 31 pm

We've put up the code here: https://github.com/playgroundtheory/fast-lru . It contains also the test cases in the test/ directory. You can npm install each one and just node main.js. We've also put our data here: https://docs.google.com/spreadsheets/d/1fzdFYhtd2_-BLimCPGAW_qJhKsDwigKBvWxx_n_gkDs/edit#gid=0

We wanted to run this by you guys because we know lots of packages depend on this, but we only wrote our lru to serve our needs for now (didn't both with most the methods, or maxAge, etc). All those are certainly doable and should not affect anything, but just will take a while to complete. We were hoping to get your feedback whether you thought a merge would be worthwhile, and if it is we could start moving in that direction but if not we can just keep ours up on the separate repo. Also, we only just wrote this and will be spending the rest of the day integrating it into our system, so of course would also appreciate making sure we didn't write anything incorrectly that could explain the discrepancy.

Thanks!

Francisco

pouwerkerk commented 8 years ago

Thanks for helping us with this; we are big fans of lru-cache!

Here's a working link: https://docs.google.com/a/playgroundtheory.com/spreadsheets/d/1fzdFYhtd2_-BLimCPGAW_qJhKsDwigKBvWxx_n_gkDs/pubhtml

haduythuan commented 8 years ago

Yeah, waiting for the release.

Thanks.

isaacs commented 8 years ago

Thanks for the test case and alternate implementation, @tolmaskey. I'll take a look at this soon and try to figure out a way to avoid the array step while still keeping the logic consistent.

Probably you're right, and the Map approach will need to be totally swapped out for a simple doubly linked list. It did avoid one performance cliff, but I see it caused another one. The unfortunate thing is that it'll probably have to be a major version bump because the Map object is exposed. Oh well. :)

kshmir commented 8 years ago

We are also having similar issues after this, we updated the library version and it seems like CPU time goes really up after using this library, I will check times with fast-lru and comment on the changes.

kshmir commented 8 years ago

So far so good, will update with a comparison tomorrow.

kshmir commented 8 years ago

Fast-LRU seems to work great, but we can't add it to our code yet, maxAge and forEach are needed, I forked it and might add it later.

isaacs commented 8 years ago

I took a look at this, and I think that a Map object is just never going to work, because an LRU does fundamentally need both forwards and backwards iteration.

I looked at the doubly-linked lists implementations out there, and didn't find one I liked. Rather than write yet another doubly-linked list inline like I'd usually do, I decided to make a module for it, especially since correct behavior is so important to this use-case, and so it should have its own tests and docs. https://github.com/isaacs/yallist came out of that (and, btw, a doubly-linked-list that has 100% test coverage and no private anything is a pretty fun 2-hour plane ride activity, if you ever find yourself in that situation.)

This weekend, I'll try to find some time to port lru-cache to use yallist instead of pseudomap. It should be a big improvement. I worry a little bit about the gc overhead of cleaning up a lot of objects, but it should probably not be much worse than what's there already, and the trade-off of not constructing giant throw-away arrays will almost certainly make up for it.

v4 coming soon :)

isaacs commented 8 years ago

I will have to add a method to yallist to pluck a node out of place and move it to the front or back, but that's pretty easily done.

tolmasky commented 8 years ago

I'm not sure I totally understand, but the fast-lru uses both a map and a linked list. The map is required to have o(1) lookup time for the cache, the linked list is maintained for least recentliness calculation. But maybe we are talking about different maps.

tolmasky commented 8 years ago

Also if you look at fast-lru the gc characteristics should be very good. You create one new node up until you hit max cache size, at which point no new objects are ever created again (this is why we were saying that fast-lru has the nice property of actually getting faster once you hit max cache size)

isaacs commented 8 years ago

@tolmasky thanks, that's good to know.

Yeah, for lookup you still need a map, but can't rely on that for recently-used-ness.

I'm going to have the map store references to the Yallist nodes, and have the nodes store the hit objects as their data (for tracking maxAge, etc.) The nice thing is that then the mru/lu counters are all unnecessary, because the recency is implicit in the list order.

tolmasky commented 8 years ago

Yeah that's what we do in fast-lru. I'm curious if there's any reason to not just take our existing work and merge it into lru-cache, it seems to work great (we have it in production now and @kshmir seems to confirm). Like I mentioned in the original bug we would be happy to fill out the missing methods vs having to have you replicate the work.

isaacs commented 8 years ago

@tolmasky That's a good suggestion. I appreciate it, and I'll definitely keep the offer in mind. At a first glance, I'd have to refactor a bunch of it out anyway (mostly because 0.8 and 0.10 support) but that's not a huge blocker.

The more compelling (though less rational) reason is that I get so few opportunities to sink my teeth into code these days, and I've been itchy to tear this module apart for a long time, so I'm probably going to end up doing it for my own spiritual satisfaction more than anything else :) I've got a few hours blocked out tomorrow with nothing else to do, and I'm eagerly looking forward to it. If it looks like it can't happen in the time I have for it, I'll probably just pull in what you've done. (I may crib from it liberally anyway, with appropriate deference to your MIT license if I do so.)

isaacs commented 8 years ago

Published lru-cache 4.0.0. It was a good time to clean up the API and bring up the test coverage, too.

I updated the spreadsheet with the insertion timing tests from the new version. The spikes are GC pauses. If you run the script with --expose-gc and call gc() at a consistent interval (eg, every time the numbers are printed out), then the curve is a lot smoother. I'm seeing slightly slower insertion times with lru-cache 4 than with fast-lru, but that could just be that you have a faster computer than mine ;) I didn't get a chance to run an apples-to-apples comparison on my laptop. It could also be that lru-cache is actually slower due to some other difference, but it should be relatively minor. I haven't dug into the --perf v8 analysis to see where it's spending time.

At any rate, the curves look the same, and are both flat (with GC spikes) so I think the core issue is pretty much solved. Thank you all so much for the research and pushing for the correct implementation.

thomheymann commented 8 years ago

Just to follow up with a performance comparison on node v4.2.1:

fast-lru@3.0.0

Iters   Avg Bytes
1900000 0.0148  63999936
1910000 0.0148  63999936
1920000 0.0148  63999936
1930000 0.0804  63999936
1940000 0.0148  63999936
1950000 0.0143  63999936
1960000 0.0141  63999936
1970000 0.0151  63999936
1980000 0.0151  63999936
1990000 0.014   63999936
2000000 0.0141  63999936

lru-cache@3.2.0

Iters   Avg Bytes
800000  0.118   57600072
810000  0.105   58320072
820000  0.122   59040072
830000  0.118   59760072
840000  0.116   60480072
850000  0.105   61200072
860000  0.116   61920072
870000  0.119   62640072
880000  0.117   63360072
^C

(Hangs with 100% CPU usage after 880,000 so had to kill process - This is reliably reproducible)

lru-cache@4.0.0

Iters   Avg Bytes
1900000 0.151   63999936
1910000 0.155   63999936
1920000 0.17    63999936
1930000 0.181   63999936
1940000 0.199   63999936
1950000 0.217   63999936
1960000 0.198   63999936
1970000 0.25    63999936
1980000 0.278   63999936
1990000 1.934   63999936
2000000 0.156   63999936

(Starts to acts up after 1,500,000 iterations mark with the average building up from 0.151 to 1.9, then dropping back down and building up again)

I think for now fast-lru seems to be performing best for us - Great work @tolmasky !

isaacs / node-lru-cache