grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.48k stars 647 forks source link

@grpc/grpc-js has memory leak issue with latest version #2448

Closed alanzhao closed 1 year ago

alanzhao commented 1 year ago

Problem description

We use spanner node js for our api, and the latest grpc-js 1.8.14 has memory leak issues, same as 1.8.13.

Reproduction steps

We use spanner node library for our api. We recently upgraded to node 18. so in package-lock json, grpc-js gets upgraded to 1.18.14. Before upgrade, we were not having the memory leak issue. After upgrade, we have a slow memory leak, manifested in 4-7 days. And our pods restarted due to low memory.

Our stop gap solution is to downgrade @grpc/grpc-js to 1.6.2 and that seems to fix the issue. Thanks for looking into this. Much appreciated.

Environment

Additional context

We have about 100 pods in prod, and all of them exhibit the symptom after the library gets updated.

murgatroid99 commented 1 year ago

This isn't really enough information to find the problem. Can you provide code that demonstrates this memory leak, preferably without needing to access an external service? Can you provide heap dumps from at least two different points in a process's lifetime that show the memory growth?

Please also check whether you are creating multiple separate client objects. If you are, either change your code to not do that or make sure you are closing them when you are done with them. See #2436, #2402, and #2003 for previous examples of this.

alanzhao commented 1 year ago
async function getValue({ userId }) {
  const sql = `
    SELECT *
    FRROM ${table}
    WHERE userId = @userId
  `;

  const query = {
    sql,
    params: { userId },
    json: true,
  };

  const [[total]] = await clients.spanner.clientA.run(query);
  return total;
}

module.exports = getValue;

Here is a sample query that we use. We do have a wrapper around spanner and have 3 clients: clients.spanner.clientA, clients.spanner.clientB, and clients.spanner.clientC.

function allClose() {
  const closePromises = [];

  if (clients.spanner) {
    if (clients.spanner.clientA) {
      closePromises.push(clients.spanner.clientA.close());
    }

    if (clients.spanner.clientB) {
      closePromises.push(clients.spanner.clientB.close());
    }

    if (clients.spanner.clientC) {
      closePromises.push(clients.spanner.clientC.close());
    }
  }

which is used in the graphql server graceful shutdown here:

return startApp(app, ready, () => allClose());

The reason why we had 3 clients to spanner is because those 3 clients actually talk to 3 separate spanner databases. Right now, we do not use 2 of them, I can try to remove it and upgrade grpc back to latest version to see if that'll fix the issue. Thanks for the tips.

murgatroid99 commented 1 year ago

Just three clients shouldn't be an issue. The problem I was talking about comes from creating a large number of clients.

weyert commented 1 year ago

Yes, I am also experiencing memory leaks with a library that uses this one. Newer versions of google-gax seem to cause out of memory dumps in Node.js v18 in GKE. Downgrading to 3.5.2 improves things. They have this library pinned to ~1.7.0. So maybe it's related to change sinc v1.8?

murgatroid99 commented 1 year ago

Do you know how quickly you were leaking memory, in terms of amount of memory leaked per request made or per period of time? An order of magnitude estimate would be enough to help narrow this down.

murgatroid99 commented 1 year ago

Update: we have been unable to reproduce a memory leak with our own tests. Please provide a complete reproduction that leaks memory so that we can investigate this further.

kruegernet commented 1 year ago

I also have this and will try to provide a reproduction when possible/time allows.

I'm coming from the last few years in Go, so IOC/DI frameworks in node are new to me and I believed originally my problem originated there, with unintentionally multiple clients. I'm still open to the notion it could be there in inversify misuse, but I'm more inclined to believe the issue is with grpc-js as with @alanzhao.

SophiaH67 commented 1 year ago

Hi,

I have also found there to be memory leaks originating from this package(also saw some mentions in google-gax at some places in the memory debugger).

From what I can tell, this seems to be an issue with certain timeouts in Detached Node / nghttp2_memory entries: image

With the following retainers list: image

This snapshot is using comparison mode in devtools with 1 HTTP request being made between the 2 snapshots, which in our app does do multiple grpc calls.

I personally could not create a minimal reproduction app, even with the knowledge of this existing codebase. I still felt like this would be useful to share, for reasons:

Thank you for your time,

Sophia~

murgatroid99 commented 1 year ago

@SophiaH67 In order to clearly demonstrate a significant memory leak, your test would need to do the following:

  1. Run your test in a loop, to show that memory usage keeps growing.
  2. Ensure that garbage collection runs, to make sure this is a real leak, and not a temporary spike in memory usage. You can do this by running Node with the --expose-gc flag and calling global.gc() before taking a snapshot.

If you do this and you observe similar memory retention in a number proportional to the number, then please share your test code and I can investigate.

SophiaH67 commented 1 year ago

@murgatroid99

Thanks for your time! I should've prefaced this by saying this is my first time debugging memory leaks, so my bad if I miss anything. From what I could tell, the record snapshot button would automatically do garbage collection, but I cannot find a source for that so don't know where I learnt that.

After rerunning my tests in our production app with manual gc() calls through devtools before snapshotting, it does still happen.

In a small reproduction setup however, I could not find any problem with this library. Which leads me to think it's probably something in either a different library, or our own code!

murgatroid99 commented 1 year ago

Have you also confirmed that the memory keeps growing with multiple successive calls? gRPC creates some resources on the first request that it reuses in later requests, so seeing more memory after the first request than before only confirms that, it doesn't show that anything is wrong.

SophiaH67 commented 1 year ago

I have found the issue our team was running into finally!

We were using nestjs's official terminus package for health checks, which created a new channel for every health check attempt.

After doing some digging, it seems that these channels are not able to be garbage collected by NodeJS, due to several self-references(in timeouts and other stuff). This by itself could also be considered a bug, but I'm not too familiar with the garbage collector to solve this.

I did make a PR in terminus to cache the channels it opens, which I built and deployed in our app.

PR: https://github.com/nestjs/terminus/pull/2329

Results: image

As you can see here, this has totally solved our issue, and explains why I couldn't reproduce it with a minimal setup!

Still thanks to @murgatroid99 for your time and explaining garbage collector stuff to me ❤️

murgatroid99 commented 1 year ago

Calling channel.close() should make it eligible for garbage collection. It looks like terminus did not do that when they stopped using the channel, which would explain why it was retained.

hvishwas commented 1 year ago

we have seen memory leaks on our services too. The ECS tasks get abruptly killed with 8 RESOURCE_EXHAUSTED: client: xxx. Exceeded rate limit.. The grpc-js version 1.9.0 and node version 18.17.1

murgatroid99 commented 1 year ago

"rate limit" generally refers to a limit on how frequently you can call an API. So, an error that says "Exceeded rate limit" generally refers to excess API call volume, not memory growth.

hvishwas commented 1 year ago

we have seen memory leaks on our services too. The ECS tasks get abruptly killed with 8 RESOURCE_EXHAUSTED: client: xxx. Exceeded rate limit.. The grpc-js version 1.9.0 and node version 18.17.1

It was a false alarm. We had upgraded our application from Node 14 to Node 18. The restarts were caused by an internal code that was throwing an unhandledPromiseRejection error and led to the application's termination. Apparently, the process termination was introduced in Node 15. https://maximorlov.com/node-js-15-is-out-what-does-it-mean-for-you/

Thibault2ss commented 1 year ago

I've been seeing the same behavior @murgatroid99 . Basically my test setup is: in a while (true) loop:

While I'm doing that, I'm logging every second the process.memoryUsage().heapUsed / 1000000 (in MB), and it slowly grows. Maybe some ref somewhere is preventing the garbage collector to garbage something.

If I run this test script with NodeJS option --max-old-space-size=500, when I see the heap used reaching ~500MB+, then I get a OOM error:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

My setup:

import { Client, Metadata, credentials } from '@grpc/grpc-js'

// some metrics I want to keep track of
const metrics = {
  frequency: 1000, // 1s
  countSinceLastTick: 0,
  countTotal: 0,
  speed: 0,
}

// this reporter will log the metrics every second
const reporter = setInterval(() => {
  process.stdout.clearLine(0)
  console.log(`processed ${metrics.countTotal}`)
  process.stdout.clearLine(0)
  console.log(`speed ${Math.round(metrics.countSinceLastTick)} calls per sec`)
  process.stdout.clearLine(0)
  console.log(`mem usage ${process.memoryUsage().heapUsed / 1000000}MB`)
  metrics.countSinceLastTick = 0
  process.stdout.moveCursor(0, -3)
}, metrics.frequency)

// this call makes a single rpc request, instanciating a new client + closing it at the end
const makeRequest = async () => {
  const sslCreds = credentials.createSsl(
    Buffer.from(process.env.ROOT_CERT, 'utf-8'),
    Buffer.from(process.env.PRIVATE_KEY, 'utf-8'),
    Buffer.from(process.env.CERT, 'utf-8')
  ) // this is just because in my case I use mTLS

  const options = {
    'grpc.keepalive_time_ms': 120000,
    'grpc.http2.min_time_between_pings_ms': 120000,
    'grpc.keepalive_timeout_ms': 20000,
    'grpc.http2.max_pings_without_data': 0,
    'grpc.keepalive_permit_without_calls': 1,
    'grpc.service_config_disable_resolution': 1,
  }

  const grpcClient = new Client(`your.host.com:8700`, sslCreds, options)

  const deserialize = <T>(argument: T) => argument
  const serialize = (argument: Uint8Array) => Buffer.from(argument)

  await new Promise((resolve, reject) => {
    grpcClient.makeUnaryRequest(
      '/your.Service/Healthcheck',
      serialize,
      deserialize,
      Uint8Array.from([]),
      new Metadata(),
      (err, res) => {
        if (err) {
          reject(err)
        } else {
          resolve(res || Uint8Array.from([]))
        }
      }
    )
  })

  grpcClient.close()
}

// main function
const main = async () => {
  while (true) {
    await makeRequest()
    metrics.countSinceLastTick++
    metrics.countTotal++
  }
}

main()
murgatroid99 commented 1 year ago

@Thibault2ss I think I fixed that with #2606, according to local testing.

murgatroid99 commented 1 year ago

That fix is out in version 1.9.8

Valeronlol commented 1 year ago

same issue here: grpc-js version: 1.9.9 nestjs/microservices: 9.4.3 node-version: node:20.9-alpine

murgatroid99 commented 1 year ago

@Valeronlol The code in https://github.com/grpc/grpc-node/issues/2448#issuecomment-1782690279 does not show a memory leak in 1.9.9 in my local testing. If you are experiencing a memory leak, please share code that demonstrates it.

murgatroid99 commented 1 year ago

I am closing this issue, because it has covered too many different memory leaks. If you find this issue because you are experiencing a memory leak in this library, please file a new issue.

adiun commented 8 months ago

@murgatroid99 thanks for the information about keeping one client.

Is that documented somewhere? I tried looking for it in the grpc-js docs. We were also wondering a while back whether to keep one instance of the client or one for each request and landed on the latter based on some assumptions about how grpc-js works but it seems like we were incorrect. So I was interested to see if there are docs that go into this, and also curious if there are other 'best practices' we should be doing with grpc-js.

murgatroid99 commented 8 months ago

This is a recommendation for gRPC overall, as stated in our performance best practices doc.

It also seems to me that this should be the default assumption for object-oriented programming in general: you create a single object that is responsible for some domain of functionality, and then you call multiple methods on that object to invoke that functionality. And you only create multiple instances when you have some compelling reason to do so, such as needing to use different constructor arguments.