parse-community / parse-server

Parse Server for Node.js / Express
https://parseplatform.org
Apache License 2.0
20.84k stars 4.77k forks source link

ParseServer memory increases with multiple cascaded queries #3977

Closed jfalcone closed 7 years ago

jfalcone commented 7 years ago

Issue Description

I want to preface my submission with the observation that there have been several reports in this repo about memory leaks / usage issues but all of them have been closed without a specific resolution. One was closed with "fixed with redis" which I didn't find very satisfying.

We've been struggling with ParseServer memory increases for the past month on a system that is quite simple.

We started with an existing Cloud Code implementation of a route which had one query and one save and was running very stable for years on Parse.com and then on NodeChef.

Recently, we added two more queries and an additional save to this route. When these extra queries and save are activated with a load of less than 1 request per second, we see the memory of our ParseServer instances on NodeChef grow by about 40KB per second. The processing load is not significant and the garbage collection runs relatively often but does not reclaim the memory which will continue to grow until we run out of memory in the containers. We've had to use memory limit rules to restart the containers to get around this.

Our code is pretty straightforward. We use the results of the previous queries when processing the subsequent queries so we use _.partial to pass the values along. We've also implemented this code as promises and as one big function previously in an attempt to reduce the memory consumption but it has remained relatively constant.

We've read other reports of similar behavior. One person reflected that its just an endemic problem and you need to rewrite from Parse to XYZ. Another said that the nature of V8 GC is that it falls behind and needs to catch up. Or that we have hidden globals, etc, etc. Or just restart your containers every so often (which is what we're doing).

Well our system is just not being stressed enough to realistically cause these problems and we've reviewed and reimplemented the code multiple times. And we started with a code base that did not grow memory like this.

The data sets being queried are also very small - on the order of a few hundred items in each collection. The JSON backup files for the data is under 100KB. (Before anyone suggests it, we could probably (and still might) rewrite the whole thing without Parse queries and instead store the data in JS arrays and objects but it would be a lot less maintainable).

Folks have also mentioned caching but again how do you eat up 140MB of memory caching 100KB of data?

Steps to reproduce

Essentially string together a series of queries which make use of the results of previous queries.

app.post('/points', auth, function(req, res) {
    var busSerial = req.body.serial_number;
    // FIRST QUERY
    var query = new Parse.Query("BusData");
    query.equalTo("serial_number", busSerial);
    query.find({
        success: _.partial(queryHandler, req, res),
        error: function(error) {
            res.send();
        }
    });
});

function queryHandler(req, res, results) {
    // normal processing using results of first query...
    // FIRST SAVE
    bus.save(null, {
        useMasterKey: true,
        success: function(newBus) {
            res.send();
        },
        error: function(newBus, error) {
            res.send();
        }
    });
    // SECOND QUERY
    var query2 = new Parse.Query("RouteStops");
    query2.equalTo('routeCode', busRoute);
    query2.ascending('stopOrder');
    query2.find({
        success: _.partial(query2Handler, bus, point, busRoute),
        error: function(error) {}
    });
}

function query2Handler(bus, point, busRoute, results) {
    // more processing using results of first and second queries...
    // THIRD QUERY
    var query3 = new Parse.Query("StopData");
    query3.near("loc", point);
    query3.containedIn("stop", stopNames);
    query3.limit(5);
    //console.log('POINTS: processArrivalStop geoquery.find');
    query3.find({
        success: _.partial(query3Handler, bus, point, busRoute, stopInfo),
        error: function(error) {}
    });
}

function query3Handler(bus, point, busRoute, stopInfo, results) {
    // more processing using results of first, second and third queries...
    // SECOND SAVE
    bus.save(null, {
        useMasterKey: true,
        success: function(newBus) {},
        error: function(newBus, error) {}
    });
}

Expected Results

Flat memory usage

Actual Outcome

The memory of each container increases until the system can no longer process requests. The memory increase is directly proportional to the number of times this particular route is invoked.

We've narrowed the source of the problem down to the code in the route from the second query onwards. When we activate that part of the code, the memory increases and when we deactivate it, the memory usage becomes flat.

Environment Setup

Logs/Trace

We've been using memwatch-next to monitor garbage collection. Occasionally there will be a big GC of 40+MB - but that might happen just once per day. Usually we just see a couple of megabytes at most and the growth outpaces what the GC is collecting.

We've had no abnormal reports in the logs in verbose mode.

flovilmart commented 7 years ago

Thanks for the report, this need a thorough investigation through a heap dump so we can find the culprit/(s). Note also that depending the nodejs version you may or may not see such increases.

Stability and reliability is a top priority, i’ll Have a deeper look on that issue

benishak commented 7 years ago

If you are not using Redis please consider using it to reduce the memory usage of Parse Server. Parse Server uses InMemoryCache by default to cache things like Sessions, Schemas and User info. The Cache size will increase with time and as you might know this Cache is just an Object that Parse Server stores data in and is completely stateful and part of Parse Server itself, so whenever Parse Server put things in that cache your Parse Server node process increases in memory usage.

Another thing InMemoryCache is implementing TTL by firing a setTimeout that calls a function to destroy/delete the Cache entry, this call is also stored in the memory till it is executed and released.

setTimeout(function() {
    delete cache[key];
}, ttl);

So you should really use Redis.

In JavaScript you have no control on the Garbage Collection and you don't know when it is going to run. Garbage Collection destories variables that are no more used.

I suggest manually destroy the variables after you finish using them

query.find().then(function(res) {
// do something with res
// ...
res = undefined;
anotherVariable = undefined;
});

I remember before Redis memory usuage was increasing and reached 1.5GB in 2-3 days so I had to restart node every 2 days. This issue was gone after I used a local Redis and limited the size of the Redis by 2GB, the Parse Server memory usage never exceeded 300MB after that.

flovilmart commented 7 years ago

I’ve ran a local profiler and I can confirm that the cache invalidations are working as expected under low load, the heap returning to the initial size after a burst of intensive requests. The cache has multiple options, besides moving to redis based cache. One thing that can grow the cache is the users / roles, and we could probably move the InMemoryCache to use the lru cache strategy by default or capping the size.

jfalcone commented 7 years ago

Thanks for the comments...

  1. We do null out our variables in each portion of the code to indicate that they are no longer used - I just removed all that code from what I posted so that it wouldn't be so long. That didn't make much difference. There are 3 loops in the code (the queries are not in the loops but we do iterate over results from query2 and query3). Later today I will post a sanitized version of the code.

  2. It's not a "user logging in" kind of system (in fact there is only one user defined in Parse) - we just basic auth the post to the route and start processing.

  3. With respect to what happens after a burst of requests, I've used Postman Collection runner to burst 3 hours worth of requests in 15 minutes and the memory usage climbs straight up so my experience isn't the same.

  4. Our system is going live on July 1 so I don't have much opportunity to do instrumentation but if you guys tell me what to do to gather info, I can do that. I'm in a hosted NodeChef environment so its not clear whether or not I can do heapdumps into files.

flovilmart commented 7 years ago

if you have enough to replay your load like you did, I recommend you test locally. The easiest way to gather headdumps would be to use the headdump package and add require(‘headdump’) as the 1st line of Cloud Code main. Then gather the headdump and see if it’s the cache that’s blowing up or else. You can also post the headdump so I can have a look

jfalcone commented 7 years ago

That's a good suggestion to test locally... but that will have to wait until after 4th of July week for me to set that up.

I have been able to run heapdiff's using memwatch-next - I haven't found them particularly instructive but if you guys might think there might be something there, I can turn that on again while I put the system under synthetic load tonight.

I've also just gone through the code and gone crazy setting unused vars to undefined - although it just doesn't add up to 50K per second. We'll see. We've been down this path for a month now and I wish I could point to something and say "this improved things" but its basically been the same no matter what we've tried. I even rewrote the code to remove moment-timezone when I read that it's had memory leaks in the past.

flovilmart commented 7 years ago

Memory profiling in node is hard, and leaks/cycles can cripple memory usage. On my side, we’re deploying on kubernetes across multiple clusters and regions. We notice memory growth only in our east cluster, which is the one that processes our periodic jobs. It’s also the one that processes the most requests. We don’t use redis as cache and have enabled singleSchemaCache with a schemaCacheTTL of 1 minute instead of 5s which reduced the number of stored keys in the cache for the Schéma and reduced the number of DB calls for the schema

flovilmart commented 7 years ago

@jfalcone I've opened also https://github.com/parse-community/parse-server/pull/3979 so we use LRU cache that would let you define a max cache size and TTL that would perhaps better fit your needs.

jfalcone commented 7 years ago

As I expected, no amount of resetting vars to null or undefined makes an difference. Since the service has gone live with full load we're growing memory at 300KB per second. I had to run the numbers twice but that's how bad it is.

We're running on 3 containers with 2 periodic (once per minute) batch jobs so I've had to set a container restart rule to fire at 55% memory. We're hitting that every 4-5 hours with a request rate of about 1 request per second to our "leaky" route which is the dominant driver of memory growth. We start at about 120MB per container and grow to about 280MB when the restart is triggered. It's overkill for our application which is also why we're so surprised that the cache and/or gc isn't handling things. And our DB is so small - seriously if we had persistent JS arrays & objects, we'd be doing that (please don't bring up Redis again - its a low budget, small-town system - let's just say that there aren't very many zeros in the budget).

The servicing of the "leaky" route and the batch jobs are load balanced across all 3 containers, but despite that we almost always see greater memory growth in one of the 3 containers - it might just be a defect in NodeChef's load balancer. Like right now I'm seeing 44% memory usage on container 2 but about 25% on containers 1 & 3.

I've attached the memory chart for an 8 hour period - you can see the 2 container restarts as the cliffs. The smaller cliffs are big sequences of (>30MB) garbage collections which randomly happen a couple of hours into a run. The chart aggregates all 3 containers on to one with scale 100 = 512MB, 300 = 1536MB.

Apparently I can do heapdumps on NodeChef so I'll have a look at doing that after the holiday as I will also set up a local server.

Nonetheless, we are limited in terms of what we can do (parameters, adapters, environment) because we are running a hosted ParseServer service on NodeChef.

screenshot 2017-07-01 21 25 39

benishak commented 7 years ago

what is the highest memory usage you reach if you don't restart your node? because 280 MB is OK, you should worry if it goes above 400 MB but never goes down again. Are you now using the new LRUCache or inMemoryCache?

The issue could be also caused by other dependencies not Parse Server

jfalcone commented 7 years ago

It will climb to 100% on each container without having a container restart rule set up - and the batch jobs will die well before we reach that level.

I don't have control over the ParseServer environment to make those changes to the cache since I'm not building my own ParseServer in a DO or Heroku environment. NodeChef is like Buddy or SashiDo. If you guys have suggestions for how NodeChef should be doing their hosting, I can pass those on but I don't build the server myself.

As I said, it is a small application with a low request rate and this was the last situation we expected to be in. So moving to bare metal and building our environment from scratch was not part of our plan. Because we've gone live on NodeChef with "heavy" load (relatively speaking) for the next 60 days (yes, its connected to summer tourism), we no longer have the option of moving platform anyway (I'm old enough to know that would be insane since we have a perfectly functioning system despite the memory leaks).

What blows me away is how we could be eating memory so quickly when our whole database is just about 100K bytes worth.

flovilmart commented 7 years ago

if only we could get heap dumps, I would work on a fix ASAP and make sure everything is nice and neat for you. It is likely to be something that can be quickly fixed. I don’t notice such heap growth on short period of times on all the servers that we’re running, while I still see an increased memory usage, that will require more investigation on my side. We’ve had issues with different versions of nodejs that would yield more leaks than others.

In the meantime, the best for you, If tou’re Reaching for help, and not just complaining, would be to do what we suggested and gather heapdumps from a synthetic load on a local environment. It’s been suggested 2 days ago, and with that info I would have been able to provide a fix.

Also, as you mention you’re running on nodechef and I’m not sure they use the official parse-server nor that it’s up to date.

benishak commented 7 years ago

I'm a bit confused if you are now using NodeChef as if it was Parse.com, so basically you have an api hostname and you only deploy your cloudcode somewhere, you actually shouldn't be concerned about the memory usage then, I know this doesn't sounds helpful but i'm sorry for this because I couldn't fully understand how NodeChef is managing Parse Server, deployment, does they give every one a container with Parse Server or they host one Parse Server that scales and support multiple apps somehow and every client get its own API Key/Secret like Parse.com?

Try to figure out if you are able to know which version they are using, it could be an issue in older versions of Parse Server.

Are you able to modify the Parse Server Configuration by changing the attributes and setting an Adapter or something?

It could also some dependencies you are using are having a memory leaks, can you least these dependencies?

jfalcone commented 7 years ago

To answer all the questions...

Yes, we are using NodeChef like Parse.com except that NodeChef allocates separate containers for each customer to run Parse Server instances which they (NodeChef) manage - so it is technically managed Parse hosting. Those Parse instances have memory limits and if we hit the memory limit, the task or job is killed. So the memory leak does impact us rather severely.

As the Parse Server hosting is managed by NodeChef, we can not change the attributes and set up adapters so the whole redis thing is a non-starter. In return, we don't have to worry about building and deploying the server which for our purposes makes the project economically viable.

We are running on ParseServer 2.4.2 but NodeChef just told me that they are running 2.4.2 on node.js 4.4.7 which I know is a no-no since 4.6 is the minimum for 2.3.3 and above. And I have read that folks have had better memory management with node.js 4.5 and above.

In the code that has the most leakage, we have very few dependencies. We use just a little bit of underscore (partial), , body-parser, and basic-auth. I said it was simple! I rewrote the code to remove momentjs.

Finally, I have been told that I can run heapdump on NodeChef so when the system goes quiescent late tonight I'll take a heapdump, run the heavy synthetic load, and take another heapdump.

flovilmart commented 7 years ago

I also know that they use a fork based upon official versions and not the official versions. I’be taken headdumps on my own, with the latest LRU cached based cache, which is default on 2.5.1, and the big chunk of the heap growth is contained in that cache which self prunes when reaching the maxSize of 10000 items set by default. This default may be too high for you.

Also I’ve noticed that http responses may stay longer than expected in the heap, which can lead to more strings than expected being cached.

That being said, our system is quite stable with 500mb of RAM per container on kubernetes which is roughly equivalent to a micro instance on AWS. The tricky part in hosting Parse server is not the server but the DB.

jfalcone commented 7 years ago

I finally figured out how to take heap dumps on NodeChef so I can do a test run tonight. They are about 50MB apiece so what would be the best way to send them to you.

flovilmart commented 7 years ago

You should probably debug them yourself and come back with the findings, they contain all the private keys, application ID, master keys etc...

jfalcone commented 7 years ago

We'll have another opportunity to do a test run late tonight so I'll have a look.

jfalcone commented 7 years ago

Any particular objects I should be on the lookout for?

flovilmart commented 7 years ago

I'm not sure, anything that significantly increases the heap and never gets released.

jfalcone commented 7 years ago

So I've taken heap dumps at various times as memory usage nearly doubled in some cases and they vary in size by only about 5%. Sometimes snapshot 2 (with larger reported memory usage) is actually smaller than snapshot 1. There is no obvious spike in # of objects, sizes, etc. In other words, the heap doesn't grow but memory usage still grows.

That doesn't surprise us - our code is very simple, we aggressively null out vars after they are no longer used, and there are no obvious mistakes that would cause a memory leak. More importantly we've rewritten the code several times in an effort to expose or isolate the leak to no effect.

Unfortunately, I believe that what is growing is not in the snapshots because "Only reachable objects are included in snapshots." This belief may be confirmed by the observation that we've had periodic garbage collections that have reclaimed 50MB or more of memory after a container has been running for an hour or two. And that supports the belief by some that the V8 garbage collector "falls behind" and only sometimes catches up on some containers.

We had hoped that adding another container to spread the load would give the GC more time to catch up but it is all kind of silly given how low our request rate is (approx 1 per second).

There was also a thread somewhere about GC issues with the strings in console.log so we also reduced the amount of logging we do - also with no apparent impact.

The folks at NodeChef have made a number of improvements to make life easier for me this summer (e.g. per container memory monitoring rules) and they apparently will support an upgrade to nodejs 6.x soon as well.

jfalcone commented 7 years ago

Whoa... I did some tinkering and at least my code thinks that we're running parse-server 2.4.2 on nodejs 4.3. If that is the case, God only knows what's going on. Kind of scary.

flovilmart commented 7 years ago

At least it doesn’t seem there is a systemic leak of objects/closures are runtime not retain cycles that would prevent proper memory recollection in normal use. I’m not sure why the overall process memory climbs up as such.

dpoetzsch commented 7 years ago

This looks a lot like #3576.

We were doing fine with node 6.10.0, but now with the heroku strongly recommended to upgrade to 6.11.1 because of a "high severity remote Denial of Service (DoS) Constant Hashtable Seeds vulnerability in Node.js versions 4.x though 8.x" which has been patched in the new version.

flovilmart commented 7 years ago

@jfalcone what should we do with that issue? I'm not sure the investigation led to any substantial issue on parse-server's side. Did you enable NODE_ENV=production as expressjs has multiple optimisations.

jfalcone commented 7 years ago

NodeChef is now allowing us to change NodeJS version so I'm going to change NodeJS version to one that is most compatible with 2.4.2 and has the security fixes = 6.11.1.

IF that does not change the memory leakage behavior, then its back to Plan C or D, which is to create a local test Parse Server and try out many combinations of settings to isolate the problem. So right now I would not make assumptions as to where this problem originates. But that would have to wait until late August / early September due to other commitments.

jfalcone commented 7 years ago

Upgrading to NodeJS 6.11.1 caused an error so I could only upgrade to 4.8.4. We will see what happens.

Error did have parse-server-gcs-adapter in the path...

App1_xxxxxx at Object.Module._extensions..js (module.js:579:10) at Module._compile (module.js:570:32) at Object. (/nc_modules/parse-server-gcs-adapter/node_modules/gcloud/node_modules/grpc/src/node/src/grpc_extension.js:38:15) at require (internal/module.js:20:19) at Module.require (module.js:497:17) at Function.Module._load (module.js:438:3) at tryModuleLoad (module.js:446:12) at Module.load (module.js:487:32) at Object.Module._extensions..node (module.js:597:18) at Error (native) Error: Module version mismatch. Expected 48, got 46.

^ return process.dlopen(module, path._makeLong(filename)); module.js:597

flovilmart commented 7 years ago

You just need to cleanup your node_modules, so you can perform the install correctly.

jfalcone commented 7 years ago

Thanks - I'm still going to run a day on 4.8.4 just to have the numbers - tomorrow I'll move to 6.11.1

jfalcone commented 7 years ago

We ran a day of 2.4.2 with NodeJS 4.8.4 and a day with 6.11.1 and here are the memory usage graphs for about 12 hours each day with similar loads...

screenshot 2017-07-25 12 15 07 The above graph is ParseServer 2.4.2 with NodeJS 4.8.4 - each sawtooth in the graph is a container restart - there were 6 during the period due to memory reaching 66% of 512MB per container.

screenshot 2017-07-25 23 24 51 The above graph is ParseServer 2.4.2 with NodeJS 6.11.1 - no sawtooths, no appreciable memory growth, and no container restarts triggered by memory usage. It appears that even when memory usage did grow, garbage collection "caught up" and reclaimed the memory. In fact, after almost 12 hours, each container is running at about 16-18% of 512MB which is about at where they started at. Its been mentioned that 256MB is plenty for running ParseServer and that would be enough for us if it weren't for our batch jobs. Even with the batch jobs, I'd consider dropping to 256MB for the off-season load.

So the memory leak is - for all intents and purposes - gone.

We do, however, see another anomalous behavior that was mentioned by another user - CPU usage does increase over time. Someone theorized that as memory usage fragments over time, the garbage collector gets busier to maintain order as it were. Still the numbers are very low for our application so its a non issue and not a ParseServer issue.

The NodeChef guys put in extra work today to get us on to 6.11.1 so we have to thank them for that.

Since 6.11.1 is the NodeJS LTS release, this is a great result.

jfalcone commented 7 years ago

BTW the suggestion to "null-out" variables makes no difference. We just removed all the blah = null / undefined; statements and it makes no difference to memory usage. It never did in all the testing we did. It was never about our code but unfortunately until NodeChef allowed us to migrate to newer versions of NodeJS it was the only avenue we had.

Likewise, all of the suggestions about coding practices were pretty much wishful thinking. Those of us old enough to have done assembly language programming (40 years ago for me) - ultimately folks its the bits - there is no magic in high-level languages. If you lose track of the bits, then stuff happens. In the old days, we had to do our own memory management and you never lost track of the bits. Whether you are using Promises, Contracts, Callbacks, whatever, it all pretty much compiles down to the same stuff and gets interpreted pretty much the same way (please - we're not here to have a religious argument about why x language is better than y language). And if the compiled-and-interpreted-framework (read NodeJS & V8) loses track of the bits, then you are pretty much up the creek without a paddle. And that is what happened.

What I'm saying is that those who said Promises would fix this problem - well, that's just a denial of what happens at the machine level. Machines don't execute Promises. That Promise had to make its way through ParseServer and NodeJS and V8 down eventually to a bare machine and that's where it was doomed. Trusting the code underneath your code gets you what you paid for it - well, you all know the adage Trust No One. Perhaps Trust No Library. Trust No Module. Trust No NodeJS.

Sherlock Holmes said "when you have eliminated the impossible, whatever remains, however improbable, must be the truth." That's what we got here folks.

For some reason, earlier versions of NodeJS (4.84 and below) and earlier versions of ParseServer (2.4.2 and below) just don't mix with respect to having more than a little bit of asynchrony. It isn't worth investigating why but the proof is there: we didn't change a line of our code when migrating to NodeJS 6.11.1 and we now have correct memory behavior.

We're now removing all of the crap we added to try to fix the problem which will return the code to the way it was 2 months ago when it was functionally correct but leaking memory. My hypothesis is that there will be no memory leak with that code either because the problem was with the interaction of NodeJS 4.xx and ParseServer. I'll let you guys know what happens.

flovilmart commented 7 years ago

glad it's all good!

on promises though, it is highly possible that promises provide a better memory management than assigned callbacks, depending on the way the callbacks are managed and their lifecycle.

jfalcone commented 7 years ago

The bottom line of our analysis is that it made no difference whether you used callbacks or Promises in terms of memory management. Our heap dumps showed no differences in referenced memory usage. And of course the garbage collector has no idea what you are using.