Closed tbadlov closed 9 months ago
It looks like the issue node #28420 is impacting the version of node you are using. I suspect it could be the root cause.
@DominicKramer node version upgrade didn't resolve the issue.
We are observing a similar issue on the upgraded node version. The memory usage gradually climbs until OOM kill when debug agent is enabled. After it was disabled the memory usage is stable.
Running on GKE (Google Kubernetes Engine) pod
@google-cloud/debug-agent
version:4.2.2Thank you for the update. Have you seen similar problems on GCP services other than GKE?
That's the only GCP service we use. We have all our services running under kubernetes
Ours is on AppEngine
@legopin and @tbadlov thanks for the update.
if anyone has a repro gist or repository we can run that leaks this way that would help us get to the bottom off this a little more quickly. It helps to confirm that we're solving the problem you're hitting.
@soldair Here is a reproducible code snippet: This was run in GKE as a single pod, continued to observe memory leak even when no requests were received.
index.js
require("@google-cloud/debug-agent").start({
projectId: 'ID',
allowExpressions: true,
serviceContext: {
service: 'debug-leak',
},
});
const Koa = require("koa");
const app = new Koa();
// response
app.use(ctx => {
ctx.body = "Hello Koa";
});
app.listen(3000, ()=>{console.log('server started')});
Dockerfile
FROM node:12.16.0
COPY package.json package-lock.json /tmp/
RUN mkdir /opt/app \
&& mv /tmp/package.json /tmp/package-lock.json /opt/app \
&& cd /opt/app && npm ci
WORKDIR /opt/app
COPY src /opt/app/src
EXPOSE 3000
ENV NODE_ENV production
CMD ["node", "src"]
Hello, are there any updates on this issue? Were you able to reproduce the issue?
We are experiencing this as well and had to shut down the debugger. The image below shows the same app deployed to two clusters, one with the debugger on and one with the debugger off.
The debugger is a big piece of our workflow so this is pretty dramatic :)
I just tested the same code with 12.16
and same results. The library in the current state is not usable. How is this not a p0
on a LTS
version of node!
@kilianc If I understand correctly this issue isn't happening on 10.x
?
@bcoe can't confirm, we're running on 12.x
it's very easy to repro with the example provided (just changing the FROM
to 10.x
.
How can I help?
Are there any updates on the issue? We ae running on GCE:
Memory leaks occur when we have any logpoints. In 2 hours our VM exceeds it's memory having only one logpoint and die. Only stop and start can help (SSH doesn't work also)
I know that this is reported well enough. Just wanted to pitch in that we are experiencing the same in our GAE Flex in all services that utilise this library. We've tested this by removing npm libraries one-by-one and stress testing our application to see the memory consumption. Only the versions with the debug-agent are affected with this leak.
I think it is also important to note that the less log lines there are the slower the memory leak. So this seems to be somehow related to logging.
I would even say that this is a critical issue as it makes it impossible to utilise this library.
It seems our services on GKE that use this library are also affected by this issue. It seems to be a problem on any node version from 12.11.0 onwards with debug agent 5.1.0
Deployed a change to one of our services today. That change was to drop @google-cloud/debug-agent
. We went from erroring with an OOM after every ~10-20 requests to having done a million requests without any OOM. Definitely an issue with this package or one of its deps.
I'm looking into this issue. I've been able to reproduce a memory leak on GKE using the snippet from above, though I'm not convinced that I've covered the problematic codepath as the leak is slow (~5MB/hour) and seems to plateau eventually.
Unfortunately, I am currently unable to track down the actual memory leak when running it locally with node 12.18.3 and using the memory allocation timeline. I'll continue to investigate.
@mctavish Our particular app loads and executes JS dynamically on each request. I think that consumed memory very quickly. You might try recreating that dynamic load of additional JS to increase the memory burden and demonstrate the leak more quickly.
@mctavish the charts that I posted above are from a staging service without traffic (<10rps)
I can confirm that we are seeing this exact issue in production. We were able to turn off the cloud debugger and resolve the memory leak we detected. Based on our heapdumps we suspect the memory leak has something to do with how access_tokens are stored.
I can also confirm that this caused a memory leak in our production environment. We missed it in staging because of how often we deploy. This library is effectively unusable on Node 12.x (< 12.16).
I was able to reproduce the memory leak on both GAE standard and locally deployed application. The key to reproduce is to have an active logpoint that is set at a place which is reached periodically. According to the local application's heap usage, every time the logpoint is triggered, the memory usage will increase around 300KB. That is, if each request triggers one logpoint, then the memory usage will quickly reach 256 MB (default memory limit of GAE standard instance) before serving 1000 requests.
Further investigation shows that, the "memory leak" is not caused by the code in cloud debugger, but rather it is caused by the fact that the v8 inspector session stored every context for each breakpoint hit (cloud debugger uses v8 inspector breakpoint to realize the snapshot/logpoint).
This is verified by my local environment where an application would still suffer from the same "memory leak" issue when it does not have cloud debugger, but has v8 inspector setting a breakpoint at a busy place.
Storing each of the breakpoint hit context might be a feature provided by v8 inspector, but as for the usage of cloud debugger, this is not what was needed, because cloud debugger does not need the context as soon as after catching the paused event and taking the snapshot/evaluating the logpoint. So to fix the memory leak, cloud debugger should clean up those contexts. However, according to the v8 inspector documentation, there seems to be no quick way to achieve the goal.
One possible way for cloud debugger to fix this problem is to monitor the number of breakpoint hits, when reaching certain threshold, it closes the current session to v8 inspector, re-connect and re-set all the existing snapshots/logpoints. (I have verified it locally that closing the session can clear all the breakpoint-hit contexts thus releasing the memory usage, but will also clear all the previously set breakpoints). The only concern here is the performance hit of such a reset (need to measure).
@Louis-Ye in my case I had no active logpoints nor breakpoints, unless they were not visible in the UI and cached in the system.
Thanks for looking into this.
A bug on the V8 Engine is opened about the related issue: https://bugs.chromium.org/p/chromium/issues/detail?id=1208482
@Louis-Ye in my case I had no active logpoints nor breakpoints, unless they were not visible in the UI and cached in the system.
Thanks for looking into this.
Just to echo the above comment, we're also getting this issue deployed to Cloud Run using Node 14 with no log/break points every being set. Removed the debug agent and memory usage constant.
Hi @kilianc and @rhodgkins, I wasn't able to reproduce the memory leak without setting active breakpoints. The fix (for fixing memory leak when having active breakpoints) was just merged in, would you be able to try the latest fix to see if it helps to solve your problem? If not, please re-open this issue. Also it would be much appreciated if more environment information can be provided (e.g., node.js version/v8 version, debugger agent version, deployment environment, the way you measure memory usage/detect memory leak, the speed/rate of the memory leak, etc.)
Thanks @Louis-Ye! When this makes it into a release, I can also try it out on our service that was exhibiting this problem.
@Louis-Ye I don't have access to this stack anymore since this was a long time ago. The only thing worth mentioning is that all our images were alpine based http services using Koa.
@Louis-Ye I can test it out once this is released.
We are running on GKE NodeJS 12.16.0
Application also uses Koa framework
Without active break-points we also experience the memory leak issue in the past
@Louis-Ye I just tried this out in our production setup. The service immediately started running out of memory after only ~45 requests (without the debug module, it never gives a memory warning).
What I cannot work out is if it just isn't releasing memory soon enough (i.e. the threshold is too high), or if it just doesn't work. We're currently on Node 12.
Hi @somewhatabstract, just to confirm that, you see the memory warning after setting an active logpoint, and the version you were trying is later or equal than 5.2.1, right?
What is the memory limit for your environment? The default reset threshold is 30 hits. In my local test environment, this will only yield several MBs before the reset happened. Alternatively you can also specify you own threshold upon starting up the agent:
require('@google-cloud/debug-agent').start({resetV8DebuggerThreshold: YOUR_THRESHOLD}});
If reducing the threshold still cannot fix your problem, would you mind sharing more details of your about your application's setup? I would like to see if I can reproduce it.
Hi @somewhatabstract, just to confirm that, you see the memory warning after setting an active logpoint, and the version you were trying is later or equal than 5.2.1, right?
Hey, @Louis-Ye! Thanks for your response.
I can confirm that I updated to 5.2.1. However, I did not set any active logpoint. The only change was to load and start the Debug Agent, then wait as requests were handled.
What is the memory limit for your environment?
We are using an F4 instance, so our soft limit is 1024MB.
The default reset threshold is 30 hits. In my local test environment, this will only yield several MBs before the reset happened. Alternatively you can also specify you own threshold upon starting up the agent:
require('@google-cloud/debug-agent').start({resetV8DebuggerThreshold: YOUR_THRESHOLD}});
I did not realize this setting was there; this may well help. Read on for more info.
If reducing the threshold still cannot fix your problem, would you mind sharing more details of your about your application's setup? I would like to see if I can reproduce it.
Our use case is perhaps unique. Our service will dynamically download and execute code, using an in-memory cache that we purge when reaching a certain memory cap. It could be that our own use of memory and the usage of the Debug Agent is causing us to hit our caps. If I am able to make time, I will try deploying a version without the caching to see if Debug Agent, by default, just handles things. If that is the case, then we probably just need to tweak the limits to ensure they can co-exist nicely.
Though this isn't the only service we have that has exhibited the memory leak (even without setting an active logpoint), this is the one that exhibits it most easily, we've found (the leak always goes away if we never load and initialize Debug Agent). You can see the main code of the service here: https://github.com/Khan/render-gateway. Our actual service configures this code for our specific circumstances to take production traffic in a secure manner. I believe I could make a service that would do the same but without all the bells and whistles we have for determining code versions and such if you wanted code that could be used to investigate this with your own deployment.
If you have any further questions or requests, please reach out. We would really like to resolve this so that we can use Debug Agent for a number of our Node-based services.
@Louis-Ye I know that you have a patch out that addresses a leak with an active logpoint, but as stated by multiple people, just turning the cloud-debugger on will result in memory leaking. I hope we can put together a minimal repro case for you!
Hi, thanks everyone for providing the information! For memory leak without active breakpoints, I wasn't sure what is the cause as it is hard to reproduce at my side, but I have a feeling that this is still about V8. So currently I'm cooking up another PR that makes cloud debugger only attache to V8 debugger when there are active breakpoints. If this works (when no active breakpoints), then the periodical reset + this lazy-V8 attaching should sufficiently solve the memory leak problem. If the lazy-V8 attaching does not work, then we will have to dig into another direction.
The lazy-v8-attach patch is available on version v5.2.4. Please let me know if the memory leak problem still exists.
@Louis-Ye Thanks! I'll carve out some time this week to try it out and let you know.
@Louis-Ye I tried out version v5.2.5 and the memory leak is still occurring - there does not appear to be any improvement. 😞
@somewhatabstract Thanks for trying the new patch! And sorry to hear that it doesn't fix your problem. I went on and clone https://github.com/Khan/render-gateway to my local machine and started the example application with cloud debugger enabled, and I did not notice a memory leak there. Can you share what is the way you enable debugger (like which line you put require('@google-cloud/debug-agent').start(...)
) and what is the configuration/deployment method for you to create the memory leak situation? Thanks again for providing the information!
Can you share what is the way you enable debugger (like which line you put
require('@google-cloud/debug-agent').start(...)
) and what is the configuration/deployment method for you to create the memory leak situation? Thanks again for providing the information!
Yeah, the example probably doesn't do much of what may cause the issue (I'm not sure it even enables the debug agent).
To use the debug agent, the call to runServer
(from import {runServer} from "render-gateway";
) must pass options that specify debugAgent: true
in the cloudOptions
:
runServer({
cloudOptions: {
debugAgent: true,
profiler: true,
},
....other options...
});
In addition, the renderEnvironment
option provides an environment that obtains JS files from our CDN for rendering our frontend. These are downloaded dynamically and executed inside a JSDOM environment.
When we pass debugAgent: false
, disabling the Cloud Debug agent, no memory leak occurs. Pass true
and it leaks.
The code that the debugAgent
option controls is here:
https://github.com/Khan/render-gateway/blob/master/src/shared/setup-stackdriver.js#L15-L18
(StackDriver because that's what it was branded as when the code was first written).
That is invoked here:
https://github.com/Khan/render-gateway/blob/master/src/shared/start-gateway.js#L65-L66
Let me know if you need more info. I may be able to carve out some time to give a working example that includes the leak, but I'll have to see what my schedule looks like for that.
@somewhatabstract Thanks for the information! I modified the runServer
in https://github.com/Khan/render-gateway/blob/ee04f6ddd49f68e97e498b4b1bb5940df3a17675/examples/simple/run.js with the parameter you provided (cloudoptions: {debugAgent: true}
), and then I do yarn start
to start this application on a GCE instance and run it for a day. The memory usage I saw increases from 45 MB to 50 MB in a day. The 5 MB extra is too small to tell that if there is a memory leak or not. Do you happen to remember what is the rate of your leak?
I may be able to carve out some time to give a working example that includes the leak, but I'll have to see what my schedule looks like for that.
That would be great! As we really want to solve this issue, and the largest barrier we have right now is to reproduce this issue.
Hi,
I have been observing the same issue with the debug agent and the spanner library on an App Engine service.
Running on App Engine Flex with the nodejs runtime, I have a minimal working app with only liveness/readiness checks, that do a SELECT 1
query on a Spanner instance (automatically called several times per minute by App Engine's internal liveness checks).
The issue started after I accidentally relaxed the node version set in package.json from ^10
to >=10
.
The memory usage increases at approximately the following speed:
I observed the same for all sorts of combinations of the Spanner & debug agent versions, including the most recent ones (resp. 5.12 and 5.2.7), as I tried playing with the versions before I realized the node version was the cause.
You can find a repo with the code used to reproduce this issue at the following link: https://github.com/shinetools/memory-leak-test
Hi,
Faced the same issue recently, our pods kept crashing in a loop due to memory usage constantly increasing and pods run reach their limits.
We updated our implementation to only activate Cloud Profiler & Cloud Trace, but deactivating Cloud Debugger and the issue is gone.
@somewhatabstract Thanks for the information! I modified the
runServer
in https://github.com/Khan/render-gateway/blob/ee04f6ddd49f68e97e498b4b1bb5940df3a17675/examples/simple/run.js with the parameter you provided (cloudoptions: {debugAgent: true}
), and then I doyarn start
to start this application on a GCE instance and run it for a day. The memory usage I saw increases from 45 MB to 50 MB in a day. The 5 MB extra is too small to tell that if there is a memory leak or not. Do you happen to remember what is the rate of your leak?I may be able to carve out some time to give a working example that includes the leak, but I'll have to see what my schedule looks like for that.
That would be great! As we really want to solve this issue, and the largest barrier we have right now is to reproduce this issue.
Sorry for the delay @Louis-Ye. I haven't forgotten, I just haven't had an opportunity to look at this yet.
As for the rate of the leak, it was quick. Like ~25-40 requests. However, each request was using the JSDOM environment, and loading a lot of JavaScript. I imagine the examples just aren't loading an awful lot on each request.
can confirm, we were regularly seeing 503
errors on our Cloud Run instances with @google-cloud/debug-agent
enabled. After disabling, memory usage was constant and solved the issue:
We're running a node:16.6-alpine3.14
containers on Cloud Run with @google-cloud/debug-agent
version 5.2.8
We are facing this issue as well with the GoogleCloudPlatform/microservices-demo
sample application.
node:16-alpine
.The currencyservice and the paymentsservice, both are nodejs application. The pods running them continue to increase in memory for around 10 hours until they get killed and are re-scheduled.
The cloud profiler shows that the request-retry
package used by google-cloud/common
is what's taking up a lot of the memory.
Here is a heap distribution between two instances where the debugger was enabled and disabled.
Hey @Shabirmean, that's great news that we have an internal reproduction, perhaps you could work with @Louis-Ye and we could squash this once and for all?
Thanks for stopping by to let us know something could be better!
PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.
Please run down the following list and make sure you've tried the usual "quick fixes":
If you are still having issues, please be sure to include as much information as possible:
Environment details
@google-cloud/debug-agent
version: 4.2.1Steps to reproduce
We found that the package cause a memory leak. This image shows our application memory with the package and without it.
We are running on Google App Engine with 4 nodes. Any help is appreciated.
Thanks!