grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.48k stars 648 forks source link

`@grpc/grpc-js` channel and memory leak (with a minimal reproduction) #1085

Closed merlinnot closed 4 years ago

merlinnot commented 5 years ago

Problem description

@grpc/grpc-js v0.6.9 seemingly leaks both memory and open network connections.

I did my best to provide a minimal possible reproduction by using gRPC directly, without any other layers on top, such as Google SDKs. Total size of the reproduction is ~50 LOC of readable TypeScript and uses @grpc/grpc-js, @grpc/proto-loader, ramda and path modules and packages exclusively. I hope it will make it easier to pinpoint and resolve an issue.

I'd appreciate if you could confirm that the issue is reproducible by following the steps below.

Reproduction steps

  1. Download reproduction-grpc.zip.
  2. Unzip the file.
  3. Navigate to the extracted directory.
  4. Run npm install.
  5. Run Firestore Emulator, either through gcloud or using a Docker container.
  6. Adjust the hostname and port in configuration.ts file if necessary.
  7. Run node -r ts-node/register --expose-gc index.ts.
  8. Heap size will be logged, see it steadily growing. Garbage collection is forced before every log statement to represent accurate results.
  9. Run lsof -i tcp:4500 -n -P, adjust the port in this script to the one which Firestore Emulator is running on. A list of open sockets will be displayed.
  10. Wait 10 seconds, repeat step 9. Wait 10 seconds, repeat step 9. Wait 10 seconds, ... A steadily increasing number of open sockets can be observed.

Environment

Additional context

This might possibly be a root cause of https://github.com/googleapis/nodejs-firestore/issues/768.

murgatroid99 commented 5 years ago

When I run this, I do not see the many sockets you mentioned, and heap snapshots indicate that only 2 subchannels are ever created, which is the correct behavior. The behavior I see here matches the results in my comment on the linked issue: https://github.com/googleapis/nodejs-firestore/issues/768#issuecomment-539651695. The primary cause of the leak is loading the proto file again for every request.

merlinnot commented 5 years ago

Here's a result I see on my system after running it for ~1 minute:

Screen Shot 2019-10-24 at 21 43 47

As you can see, process ID is exactly the same. I understand that the behavior might be different depending in your environment. I'll check if I'll be able to reproduce it in a CI, so you would see it in an environment that can be easily recreated.

murgatroid99 commented 5 years ago

I don't actually see why that behavior would be different depending on the environment. None of the subchannel reuse logic depends on anything environmental that I can think of.

merlinnot commented 5 years ago

I managed to reproduce it in the CI (GitHub Actions), on a Linux VM. It's in a public repository and GitHub Actions are free, so you can clone it and experiment yourself very easily: https://github.com/merlinnot/grpc-node-leak-reproduction/commit/94c1eaa40ba8c48a203a9b4f86a51e6aacba87b7/checks?check_suite_id=280340116#step:6:1

The script that is in there runs lsof -i :8080 every five seconds and starts the repro.

1) In first few logs there's just one connection. 2) In line 34 of step 6 you can see a first duplicate. 3) In line 67 you can see 3 sockets. 4) In line 91 you can see 4 sockets. ... 5) In line 2142 you can see 24 sockets and the workflow times out.

I can make it run longer if you'd like, number of sockets keeps increasing.

murgatroid99 commented 5 years ago

OK, I was seeing different results because I didn't have the firestore emulator set up properly, so no connections were getting established and the script was just running anyway.

I can now reproduce the increasing number of open connections. It looks like connections are not getting properly cleaned up when the corresponding Subchannel is removed. That is a bug, and it probably explains the average few hundred bytes per call leak that I mentioned in my linked comment, but it is not a significant contributor to the memory leak. It's just one or two leaked Http2Session objects every 10 seconds. As previously stated, the main problem is that you are loading the proto file for every request.

Ironically, if you didn't close the clients, or you reused a single client, you wouldn't see that connection leak at all. It's only happening because each subchannel periodically has no references other than from the SubchannelPool, so when the SubchannelPool timer runs it deletes those subchannels and new ones need to be created.

bcoe commented 5 years ago

@merlinnot thank you for your awesome work isolating this :tada:

merlinnot commented 5 years ago

@murgatroid99 I tested v0.6.10 and I have several observations:

I'll try to provide yet another reproduction, but next week the soonest. Memory leak is still reproducible, so I think it's best to keep this issue open.

murgatroid99 commented 5 years ago

I just published @grpc/proto-loader version 0.5.3 that might fix the memory leak.

bcoe commented 5 years ago

@merlinnot if possible, let us know if this does the trick for squashing some of the memory leaks you've been able to reproduce, at which point we can upstream the work to gax.

merlinnot commented 5 years ago

On it! I'll catch up with some critical tasks first and will test it straight away, ~45 minutes.

merlinnot commented 5 years ago

I ran the tests with @grpc/grpc-js@0.6.10 and @grpc/proto-loader@0.5.3:

Screen Shot 2019-11-04 at 20 36 52 Screen Shot 2019-11-04 at 20 37 20

The socket leak is still there, I'll have to work on isolating it (hundreds of open connections):

Screen Shot 2019-11-04 at 20 36 15

When it comes to the memory leak, it's definitely better! Here is the before:

Screen Shot 2019-11-04 at 20 41 57

And here is the after:

Screen Shot 2019-11-04 at 20 35 34

500 MBs saved! I'll now work on reproducing the socket leak again.

merlinnot commented 5 years ago

Also, the socket leak doesn't necessarily mean that there is an issue with @grpc/grpc-js, at least until I can provide a reproduction. This might be as well some mistake on my side. I personally don't think it should be blocking for releasing a new version, we can always ship yet another one with more fixes.

merlinnot commented 5 years ago

Actually, the memory leak is still reproducible using the code provided above! @murgatroid99, could you please double check if I'm correct? Maybe I missed something, but it seems to me that it's still there.

murgatroid99 commented 5 years ago

Using your original leak reproduction code, I definitely don't see the socket leak. After running the code for several minutes, I still see exactly two sockets open in the lsof output and I see exactly two Node / Http2Session objects in the heap dump, so I am fairly confident that the socket leak is fixed.

I do still see a steady memory leak, but none of the objects grpc creates seem to be increasing. The objects in the heap dumps that are increasing in quantity over time all seem to be V8 internal data structures related to compiling code. It may be the case that loading a .proto file is inherently slightly leaky, because it is not just generating data, it actually generates classes and functions, which V8 needs to compile internally.

bcoe commented 4 years ago

@merlinnot I've released a testing version of @google-cloud/pubsub as @google-cloud/pubsub@next (1.1.6-beta.0). If you have the cycles, would you mind testing this out and letting us know if you see improvements to the memory issues you were seeing?

merlinnot commented 4 years ago

Hey, sorry for the late reply, I was waiting for another API to land (closing clients in Firestore and PubSub), but then I got caught into discussions about that one and forgot about your question.

The current status is that we've discovered yet another reason for leaks: Firestore uses multiple gRPC clients and these are not being closed, more here: https://github.com/googleapis/nodejs-firestore/issues/769, https://github.com/googleapis/gapic-generator-typescript/pull/126. So now I'm waiting for libraries to pick it up, so we have another source of leaks fixed, and then... I'll continue debugging.

murgatroid99 commented 4 years ago

With the fixes linked from this issue, unused open clients/channels shouldn't be a significant source of memory or socket leaks. Channels constructed with the same arguments should use the same underlying subchannels, subchannels that disconnect should be dropped from all channels and eventually released entirely, and unused channels should not create new subchannels.

merlinnot commented 4 years ago

I personally don't have a full vertical overview of all of the components and their internals, so I'm trying to help the best way I can: by providing reproductions, contributing fixes to issues I'm able to identify myself and opening issues for discussion, which most usually have a good reasoning behind them, like the one linked above.

What I understood from the internals of Firestore is that there is a pool of clients, but clients are sometimes dereferenced without properly closing them, see https://github.com/googleapis/nodejs-firestore/blob/a4efa097ef8de9ca4944356ab8767ddaa94c4188/dev/src/pool.ts.

Please see this thread (which was not opened by me), where a lot of different people experience massive memory leaks when using newer versions of Firestore. My experience is the same. I'm using VMs with 28.8 GBs of RAM for processes that I'd normally expect to run on ~2-3 GBs. This generates much higher costs. Others have the same problem: https://github.com/googleapis/nodejs-firestore/issues/768#issuecomment-554606861.

I tested the same process after mocking the library (so it didn't even load its files) and the leak was gone, so I'm fairly certain it's somewhere there.

At this point I can't tell if that's an issue with gRPC itself, Firestore or any other of its dependencies. I'm just trying to hammer out all of the leaks I see, step by step. I do understand that you have the knowledge to assess how much of an impact might a particular leak have, but I don't. So as soon as I find one, I report, get it fixed and look for a next one.

bcoe commented 4 years ago

@merlinnot your hard work helping make this important library to the community better isn't unnoticed.

Thank you!

merlinnot commented 4 years ago

It looks much better after the fixes, I can't reproduce any leaks now. I hope that after implementation of the terminate method in @google-cloud/pubsub I'll be leak-free 🎉 If not, I'll work on a repro and open another issue. Thanks guys!

bcoe commented 4 years ago

@merlinnot is there a tracking issue on @google-cloud/pubsub for this? Thanks for all your help.

merlinnot commented 4 years ago

Yes, there is: https://github.com/googleapis/nodejs-pubsub/issues/817