Closed alanzhao closed 1 year ago
This isn't really enough information to find the problem. Can you provide code that demonstrates this memory leak, preferably without needing to access an external service? Can you provide heap dumps from at least two different points in a process's lifetime that show the memory growth?
Please also check whether you are creating multiple separate client objects. If you are, either change your code to not do that or make sure you are closing them when you are done with them. See #2436, #2402, and #2003 for previous examples of this.
async function getValue({ userId }) {
const sql = `
SELECT *
FRROM ${table}
WHERE userId = @userId
`;
const query = {
sql,
params: { userId },
json: true,
};
const [[total]] = await clients.spanner.clientA.run(query);
return total;
}
module.exports = getValue;
Here is a sample query that we use. We do have a wrapper around spanner and have 3 clients: clients.spanner.clientA
, clients.spanner.clientB
, and clients.spanner.clientC
.
function allClose() {
const closePromises = [];
if (clients.spanner) {
if (clients.spanner.clientA) {
closePromises.push(clients.spanner.clientA.close());
}
if (clients.spanner.clientB) {
closePromises.push(clients.spanner.clientB.close());
}
if (clients.spanner.clientC) {
closePromises.push(clients.spanner.clientC.close());
}
}
which is used in the graphql server graceful shutdown here:
return startApp(app, ready, () => allClose());
The reason why we had 3 clients to spanner is because those 3 clients actually talk to 3 separate spanner databases. Right now, we do not use 2 of them, I can try to remove it and upgrade grpc back to latest version to see if that'll fix the issue. Thanks for the tips.
Just three clients shouldn't be an issue. The problem I was talking about comes from creating a large number of clients.
Yes, I am also experiencing memory leaks with a library that uses this one. Newer versions of google-gax
seem to cause out of memory dumps in Node.js v18 in GKE. Downgrading to 3.5.2 improves things. They have this library pinned to ~1.7.0
. So maybe it's related to change sinc v1.8?
Do you know how quickly you were leaking memory, in terms of amount of memory leaked per request made or per period of time? An order of magnitude estimate would be enough to help narrow this down.
Update: we have been unable to reproduce a memory leak with our own tests. Please provide a complete reproduction that leaks memory so that we can investigate this further.
I also have this and will try to provide a reproduction when possible/time allows.
I'm coming from the last few years in Go, so IOC/DI frameworks in node are new to me and I believed originally my problem originated there, with unintentionally multiple clients. I'm still open to the notion it could be there in inversify misuse, but I'm more inclined to believe the issue is with grpc-js as with @alanzhao.
Hi,
I have also found there to be memory leaks originating from this package(also saw some mentions in google-gax at some places in the memory debugger).
From what I can tell, this seems to be an issue with certain timeouts in Detached Node / nghttp2_memory
entries:
With the following retainers list:
This snapshot is using comparison mode in devtools with 1 HTTP request being made between the 2 snapshots, which in our app does do multiple grpc calls.
I personally could not create a minimal reproduction app, even with the knowledge of this existing codebase. I still felt like this would be useful to share, for reasons:
@grpc/grpc-js
package.Thank you for your time,
Sophia~
@SophiaH67 In order to clearly demonstrate a significant memory leak, your test would need to do the following:
--expose-gc
flag and calling global.gc()
before taking a snapshot.If you do this and you observe similar memory retention in a number proportional to the number, then please share your test code and I can investigate.
@murgatroid99
Thanks for your time! I should've prefaced this by saying this is my first time debugging memory leaks, so my bad if I miss anything. From what I could tell, the record snapshot button would automatically do garbage collection, but I cannot find a source for that so don't know where I learnt that.
After rerunning my tests in our production app with manual gc()
calls through devtools before snapshotting, it does still happen.
In a small reproduction setup however, I could not find any problem with this library. Which leads me to think it's probably something in either a different library, or our own code!
Have you also confirmed that the memory keeps growing with multiple successive calls? gRPC creates some resources on the first request that it reuses in later requests, so seeing more memory after the first request than before only confirms that, it doesn't show that anything is wrong.
I have found the issue our team was running into finally!
We were using nestjs's official terminus package for health checks, which created a new channel for every health check attempt.
After doing some digging, it seems that these channels are not able to be garbage collected by NodeJS, due to several self-references(in timeouts and other stuff). This by itself could also be considered a bug, but I'm not too familiar with the garbage collector to solve this.
I did make a PR in terminus to cache the channels it opens, which I built and deployed in our app.
PR: https://github.com/nestjs/terminus/pull/2329
Results:
As you can see here, this has totally solved our issue, and explains why I couldn't reproduce it with a minimal setup!
Still thanks to @murgatroid99 for your time and explaining garbage collector stuff to me ❤️
Calling channel.close()
should make it eligible for garbage collection. It looks like terminus did not do that when they stopped using the channel, which would explain why it was retained.
we have seen memory leaks on our services too. The ECS tasks get abruptly killed with 8 RESOURCE_EXHAUSTED: client: xxx. Exceeded rate limit.
. The grpc-js version 1.9.0 and node version 18.17.1
"rate limit" generally refers to a limit on how frequently you can call an API. So, an error that says "Exceeded rate limit" generally refers to excess API call volume, not memory growth.
we have seen memory leaks on our services too. The ECS tasks get abruptly killed with
8 RESOURCE_EXHAUSTED: client: xxx. Exceeded rate limit.
. The grpc-js version 1.9.0 and node version 18.17.1
It was a false alarm. We had upgraded our application from Node 14 to Node 18. The restarts were caused by an internal code that was throwing an unhandledPromiseRejection error and led to the application's termination. Apparently, the process termination was introduced in Node 15. https://maximorlov.com/node-js-15-is-out-what-does-it-mean-for-you/
I've been seeing the same behavior @murgatroid99 . Basically my test setup is: in a while (true)
loop:
await
it)client.close()
While I'm doing that, I'm logging every second the process.memoryUsage().heapUsed / 1000000
(in MB), and it slowly grows. Maybe some ref somewhere is preventing the garbage collector to garbage something.
If I run this test script with NodeJS option --max-old-space-size=500
, when I see the heap used reaching ~500MB+, then I get a OOM error:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
My setup:
v18.18.2
1.9.7
x86_64
)import { Client, Metadata, credentials } from '@grpc/grpc-js'
// some metrics I want to keep track of
const metrics = {
frequency: 1000, // 1s
countSinceLastTick: 0,
countTotal: 0,
speed: 0,
}
// this reporter will log the metrics every second
const reporter = setInterval(() => {
process.stdout.clearLine(0)
console.log(`processed ${metrics.countTotal}`)
process.stdout.clearLine(0)
console.log(`speed ${Math.round(metrics.countSinceLastTick)} calls per sec`)
process.stdout.clearLine(0)
console.log(`mem usage ${process.memoryUsage().heapUsed / 1000000}MB`)
metrics.countSinceLastTick = 0
process.stdout.moveCursor(0, -3)
}, metrics.frequency)
// this call makes a single rpc request, instanciating a new client + closing it at the end
const makeRequest = async () => {
const sslCreds = credentials.createSsl(
Buffer.from(process.env.ROOT_CERT, 'utf-8'),
Buffer.from(process.env.PRIVATE_KEY, 'utf-8'),
Buffer.from(process.env.CERT, 'utf-8')
) // this is just because in my case I use mTLS
const options = {
'grpc.keepalive_time_ms': 120000,
'grpc.http2.min_time_between_pings_ms': 120000,
'grpc.keepalive_timeout_ms': 20000,
'grpc.http2.max_pings_without_data': 0,
'grpc.keepalive_permit_without_calls': 1,
'grpc.service_config_disable_resolution': 1,
}
const grpcClient = new Client(`your.host.com:8700`, sslCreds, options)
const deserialize = <T>(argument: T) => argument
const serialize = (argument: Uint8Array) => Buffer.from(argument)
await new Promise((resolve, reject) => {
grpcClient.makeUnaryRequest(
'/your.Service/Healthcheck',
serialize,
deserialize,
Uint8Array.from([]),
new Metadata(),
(err, res) => {
if (err) {
reject(err)
} else {
resolve(res || Uint8Array.from([]))
}
}
)
})
grpcClient.close()
}
// main function
const main = async () => {
while (true) {
await makeRequest()
metrics.countSinceLastTick++
metrics.countTotal++
}
}
main()
@Thibault2ss I think I fixed that with #2606, according to local testing.
That fix is out in version 1.9.8
same issue here: grpc-js version: 1.9.9 nestjs/microservices: 9.4.3 node-version: node:20.9-alpine
@Valeronlol The code in https://github.com/grpc/grpc-node/issues/2448#issuecomment-1782690279 does not show a memory leak in 1.9.9 in my local testing. If you are experiencing a memory leak, please share code that demonstrates it.
I am closing this issue, because it has covered too many different memory leaks. If you find this issue because you are experiencing a memory leak in this library, please file a new issue.
@murgatroid99 thanks for the information about keeping one client.
Is that documented somewhere? I tried looking for it in the grpc-js docs. We were also wondering a while back whether to keep one instance of the client or one for each request and landed on the latter based on some assumptions about how grpc-js works but it seems like we were incorrect. So I was interested to see if there are docs that go into this, and also curious if there are other 'best practices' we should be doing with grpc-js.
This is a recommendation for gRPC overall, as stated in our performance best practices doc.
It also seems to me that this should be the default assumption for object-oriented programming in general: you create a single object that is responsible for some domain of functionality, and then you call multiple methods on that object to invoke that functionality. And you only create multiple instances when you have some compelling reason to do so, such as needing to use different constructor arguments.
Problem description
We use spanner node js for our api, and the latest grpc-js 1.8.14 has memory leak issues, same as 1.8.13.
Reproduction steps
We use spanner node library for our api. We recently upgraded to node 18. so in package-lock json, grpc-js gets upgraded to 1.18.14. Before upgrade, we were not having the memory leak issue. After upgrade, we have a slow memory leak, manifested in 4-7 days. And our pods restarted due to low memory.
Our stop gap solution is to downgrade @grpc/grpc-js to 1.6.2 and that seems to fix the issue. Thanks for looking into this. Much appreciated.
Environment
Additional context
We have about 100 pods in prod, and all of them exhibit the symptom after the library gets updated.