googleapis / nodejs-spanner

Node.js client for Google Cloud Spanner: the world’s first fully managed relational database service to offer both strong consistency and horizontal scalability.
https://cloud.google.com/spanner/
Apache License 2.0
136 stars 100 forks source link

nodejs-spanner-memory-leak #1572

Closed asthamohta closed 1 year ago

asthamohta commented 2 years ago

1) Is this a client library issue or a product issue? As soon as a (read/write) transaction is initiated, it looks like some memory is leaked, or at least not reclaimed fast enough. Eventually closing the Database releases the memory, but obviously this defeats the purpose of re-using a pool of sessions.

Steps to reproduce

https://github.com/flovouin/nodejs-spanner-memory-leak

flovouin commented 2 years ago

Hi @asthamohta, thanks for reaching out and creating this issue. I will reply in this thread to the message you left here.

I see you are using nodejs version 5.16.3. Did you face this issue in a previous version as well? Or is it new?

I'm assuming you're referring to the @google-cloud/spanner package version rather than NodeJS version. Our version of NodeJS (the runtime) has never changed. However we do upgrade the Spanner package regularly. We only have noticed the issue recently, however it is because we started handling more traffic, making it easier to observe the leak. We're not certain the issue is linked to version 5.16.3 specifically. However you or us can easily test that using the code provided in the repository I created, and changing the package version.

Is there any specific query such as upsert, read, etc that is more likely in your experience to cause this problem?

We haven't found a clear pattern other than the usage of write transactions / sessions, which triggers the issue. However I did provide a screenshot of the memory usage over time in my GCP support ticket. The following graph shows the memory usage of each Cloud Function when triggering them 10 times a second (the Pub/Sub producer publishes 10 messages / second). We do see some usages growing faster than others. For example, performing a read in the transaction seems to use more memory than simply rolling back the transaction without any operations in it.

Feel free to reach out if you need any other information.

Screenshot 2022-02-02 at 10 43 36
asthamohta commented 2 years ago

Thanks @flovouin. So i followed the steps present on your post and I can now see the graph similar to that the you have posted has but I did not get a memory memory limit exceed issue Instead a lot of requests are failing with: severity: "WARNING" textPayload: "The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/functions/docs/troubleshooting#scalability" Which is basically caused by: The request was aborted because there was no available instance severity=WARNING ( Response code: 429 ) Cloud Functions cannot scale due to the max-instances limit you set during configuration.

flovouin commented 2 years ago

So i followed the steps present on your post and I can now see the graph similar to that the you have posted has but I did not get a memory memory limit exceed issue

At least it's good to hear that you're able to reproduce the memory behaviour. 🙂 I'm not sure why you're not getting the OOM error. Does the memory go back to its initial level abruptly at some point? This could mean that the Cloud Function instance is restarted for some reason, before it gets to the memory limit. This can be the expected behaviour of Cloud Functions (see the orange line at around 10:03 in my screenshot). Anyways, the OOM error we're getting is a symptom but not the cause of the issue, so it should not prevent us from keeping investigating.

a lot of requests are failing with [...]

In my example I'm limiting the number of CF instances to 1 on purpose, to make it easier to observe the increase in memory of a single instance. Getting a 429 could be normal and indicate that the CF can't keep up with the rate of input messages in the Pub/Sub topic. When running the example on my side, the CF could usually keep up (execution time would be anywhere between 5 to 80ms, and an execution is triggered every 100ms). But having a few slower executions could cause that (especially during warm up), so I wouldn't be too worried about those warnings!

asthamohta commented 2 years ago

That @flovouin So we are able to see a memory leak in readInTransaction @olavloite Will be looking into it

image
olavloite commented 2 years ago

@flovouin First of all many thanks for one of the most extensive reproduction cases I have ever seen!

I've been running your reproduction case for a little while now, and I've tried a couple of different things. What I see so far is:

  1. Running everything as you have specified does indeed cause memory errors. Specifically, I'm seeing occasional Function invocation was interrupted. Error: memory limit exceeded. log entries.
  2. Around the time of the above-mentioned error the same function will also start returning 429 (too many requests).
  3. The memory usage of the function drops and the function is once again working.

I don't think the function is being re-deployed. My impression is that it is a major garbage collection. The screenshot below shows the behavior that I saw using the exact same settings as you have created in your project.

Screenshot 2022-02-11 20 50 43

My suspicion is that this is not a memory leak, but quite simply a resource exhaustion problem:

  1. The Cloud Functions are being triggered 10 times per second. It seems that they are barely able to keep up with the rate that they are being triggered.
  2. Each invocation requires new memory allocations, but as there is never any pause in the invocations, there is also never any real time for (major) garbage collections.
  3. In the end all available memory has been consumed and a major garbage collection is required.

To verify the above I changed the configuration of all Cloud Functions so they allow max 3 instances instead of 1. This should give each function more breathing space to reclaim memory after an invocation. If there is a memory leak, we should still see memory consumption go up, albeit at a slower pace. That however does not seem to be the case:

Screenshot 2022-02-11 21 54 42

I changed the Max Instances setting to 3 for the functions around 9:30pm in the screenshot. The memory usage now seems to be stable. I think that you will see a similar behavior if you keep Max Instances at 1, but reduce the number of requests per second.

As you can see in the screenshot below the number of active instances goes straight to 3 once it is allowed. At approx 10pm I stopped the publisher application, and after a couple of minutes the number of instances went down to zero.

Active instances (1)

Do you have a similar limitation in your production setup? I noticed that you mentioned that you have been seeing more traffic lately. Would it be possible to try to increase the max instances setting in your production environment and see if that helps?

flovouin commented 2 years ago

Thanks for your reply and for the insights, @olavloite!

Indeed, the term "memory leak" might not be appropriate as it seems that the memory can be reclaimed. However my concern is that this is not done "fast enough".

If you look at the various examples of Cloud Functions, you'll find one named closeTransaction (it should probably be called "close database" instead). What it does is that it randomly closes the database (hence the session pool) during its execution. Of course, this is not a realistic approach to the problem, but it does seem to solve it. Memory doesn't grow, and the Cloud Function is able to keep up with the rate of input messages. Here's a screenshot of the number of unacked messages in the subscriptions corresponding to the same time frame as my previous message (the focus is on closeTransaction, but you can see that other Cloud Functions are mostly keeping up with input messages, apart from a few glitches):

Screenshot 2022-02-14 at 07 56 08

While I understand your point about resources being exhausted, it does seem possible to avoid memory growing indefinitely, even in the (granted, very specific) setup I'm showing here. So maybe what I'm asking for is: Is there any configuration parameter (existing or to be introduced) that could help the session pool and/or session objects to release memory faster?

About increasing the number of instances, unfortunately this is not possible in our case. We do have many other Cloud Functions for which we let the autoscaling algorithm do its thing, but in this particular case we have to "disable the concurrency" of this Cloud Function. Also, the input throughput is slightly different from the reproduction case: input messages are actually emitted in batches. Every 3 minutes, tens of messages are produced in the input Pub/Sub topic, but then nothing happens for the rest of the 3 minutes. The single instance does seem capable of handling the throughput. In fact, most of the time the number of active instances goes back down to 0:

Screenshot 2022-02-14 at 08 13 44

However if we zoom out, we see that it's (probably) actually the same instance being reused, and that its memory usage is growing over time:

Screenshot 2022-02-14 at 08 14 50

In this production use case, it would seem that the instance would have plenty of time to "garbage collect" the memory, but it does not so, hence my initial request of "finding a way to release memory faster in the Spanner client".

I hope this makes sense. Please tell me if you need additional details!

olavloite commented 2 years ago

@flovouin

closeTransaction: (Or as you said: closeDatabase). Yes, this function seems to be able to keep up. That was also the case in my test run. If you look at the different memory usage graphs in my test run, you can see that closeTransaction has a stable memory usage both before and after increasing the MaxInstances configuration. I think that this is down to the simple fact that it does less than most of the other functions, and explicitly cleans up memory by calling database.close(), which means that its base memory usage will always be lower than many of the other functions.

Regarding your production use case: The batched nature of the production use case probably means that the base memory usage of your function will be higher than in the reproduction case, as there will be more sessions needed to handle the simultaneous requests. The reproduction case uses a relatively constant stream of requests, which means that there will normally not be many requests running in parallel. The batched nature of your production scenario will probably cause a lot of parallel requests to be executed, which means that the session pool will normally contain more sessions than in the reproduction case.

I am therefore beginning to wonder whether the problem that you are seeing on production is caused by the function simply needing more memory than has been assigned to it. The default for Cloud Functions is 256MB, but it can be configured to be higher. That would mean that it is not a garbage collection problem, but an actual out-of-memory problem.

  1. Are you seeing any error message in the log around the time that the memory usage drops (in your graph just before midnight 13 Feb)? When using your reproduction case I saw occasional 'Memory limit exceeded' (but not literally 'Out of memory') errors, and a lot 429 response codes, hence my assumption that the Javascript engine at that moment was executing a stop-the-world garbage collection.
  2. Would it be possible to increase the assigned memory for the Cloud Functions instance instead of increasing the max instances? If there is no actual memory leak, but the function just requires more memory, it should reach a plateau somewhere, which could be higher than the current 256MB limit. That would also explain why the Javascript engine is not reducing the memory footprint, even though it should have the time for it; there would in that case be no memory to reclaim through garbage collection.
flovouin commented 2 years ago

Hi @olavloite,

Regarding closeTransaction, maybe it's worth having another test with the function doing a bit more work, and check that database.close() still works around the problem? In my initial examples, I found that transaction.commit() was enough to observe the memory increase. But I can have a fair comparison by adding the logic of readInTransaction, which seems to be the function that reproduces the issue with the most consistency.

Regarding production:

[...] as there will be more sessions needed to handle the simultaneous requests.

A single Cloud Function instance processes messages/calls one at a time, and we clearly see in the logs that executionIds do not overlap (executions aren't concurrent). So I'd argue that there's not that much difference between the two usages:

During the processing of a batch of messages (every 3 minutes), you can see that at first the Cloud Function doesn't necessarily need a lot of memory. It's only after the processing of many batches (over a period of several hours) that the memory grows. So I don't think that the batch nature of the job necessarily affects memory usage notably.

In the test example, I accidentally left a log output describing the state of the session pool. I found out that a single session was used, which makes sense as the Cloud Function instance does not process more than one message at once. Although we haven't added this debug line in production, I would expect the same observation.

My hypothesis is therefore that a single session would somehow retain a lot of memory over time. This memory is not lost and can be reclaimed, but only when the session is deleted (which occurs in closeTransaction when the entire database/pool is closed).

Now to answer your questions:

  1. Yes, we do see a single Function invocation was interrupted. Error: memory limit exceeded. error message whenever the memory usage drops. My initial assumption was that this would trigger the termination of the instance and recreate another, which would explain the drop in memory usage. I do see 429 warnings as well. As far as I understand, those simply notify the user that the number of Cloud Function instances is not able to keep up with the input requests/messages. Those occur even when the Cloud Function is actually processing messages. The Cloud Function autoscaling algorithm is a bit too reactive on that end, and will tend to create a lot of instances in response to a batch of input messages (which it can't when max_instances = 1).
  2. We can definitely increase the memory of the Cloud Function, and I can try that. However I'm not confident it will solve the problem due to the explanations in the rest of this message.

EDIT: I added links to the references from my comment below.

flovouin commented 2 years ago

Here's some additional links that might concur with my assumptions:

https://cloud.google.com/functions/docs/concepts/exec#auto-scaling_and_concurrency

Each instance of a function handles only one concurrent request at a time

https://cloud.google.com/functions/docs/concepts/exec#errors

If your code or any other code you call throws an uncaught exception or crashes the current process, then the function instance might be restarted before handling the next invocation.

Sorry for not referencing those earlier.

olavloite commented 2 years ago

@flovouin Version 6.1.1 included a fix for a memory leak in the session pool. Have you updated your application to that version? If yes, are you still seeing this issue?

flovouin commented 2 years ago

I'm not working on this project anymore so I can't say whether this fixed the issue. Maybe @qbarlas can chime in here? Otherwise I guess we can run the example project against the newest client version and see if we no longer see memory increases.

qbarlas commented 1 year ago

Oops, sorry for the delay, didn't noticed the ping. We're still on a 5.x. so I can't tell yet.

olavloite commented 1 year ago

Closing this as it seems to have been mitigated. Please feel free to reopen if that is not the case.