Closed merlinnot closed 4 years ago
When I run this, I do not see the many sockets you mentioned, and heap snapshots indicate that only 2 subchannels are ever created, which is the correct behavior. The behavior I see here matches the results in my comment on the linked issue: https://github.com/googleapis/nodejs-firestore/issues/768#issuecomment-539651695. The primary cause of the leak is loading the proto file again for every request.
Here's a result I see on my system after running it for ~1 minute:
As you can see, process ID is exactly the same. I understand that the behavior might be different depending in your environment. I'll check if I'll be able to reproduce it in a CI, so you would see it in an environment that can be easily recreated.
I don't actually see why that behavior would be different depending on the environment. None of the subchannel reuse logic depends on anything environmental that I can think of.
I managed to reproduce it in the CI (GitHub Actions), on a Linux VM. It's in a public repository and GitHub Actions are free, so you can clone it and experiment yourself very easily: https://github.com/merlinnot/grpc-node-leak-reproduction/commit/94c1eaa40ba8c48a203a9b4f86a51e6aacba87b7/checks?check_suite_id=280340116#step:6:1
The script that is in there runs lsof -i :8080
every five seconds and starts the repro.
1) In first few logs there's just one connection. 2) In line 34 of step 6 you can see a first duplicate. 3) In line 67 you can see 3 sockets. 4) In line 91 you can see 4 sockets. ... 5) In line 2142 you can see 24 sockets and the workflow times out.
I can make it run longer if you'd like, number of sockets keeps increasing.
OK, I was seeing different results because I didn't have the firestore emulator set up properly, so no connections were getting established and the script was just running anyway.
I can now reproduce the increasing number of open connections. It looks like connections are not getting properly cleaned up when the corresponding Subchannel is removed. That is a bug, and it probably explains the average few hundred bytes per call leak that I mentioned in my linked comment, but it is not a significant contributor to the memory leak. It's just one or two leaked Http2Session objects every 10 seconds. As previously stated, the main problem is that you are loading the proto file for every request.
Ironically, if you didn't close the clients, or you reused a single client, you wouldn't see that connection leak at all. It's only happening because each subchannel periodically has no references other than from the SubchannelPool, so when the SubchannelPool timer runs it deletes those subchannels and new ones need to be created.
@merlinnot thank you for your awesome work isolating this :tada:
@murgatroid99 I tested v0.6.10 and I have several observations:
@google-cloud/firestore
(although this dep is pinned in google-gax
, I forced it to run on v0.6.10).I'll try to provide yet another reproduction, but next week the soonest. Memory leak is still reproducible, so I think it's best to keep this issue open.
I just published @grpc/proto-loader
version 0.5.3 that might fix the memory leak.
@merlinnot if possible, let us know if this does the trick for squashing some of the memory leaks you've been able to reproduce, at which point we can upstream the work to gax
.
On it! I'll catch up with some critical tasks first and will test it straight away, ~45 minutes.
I ran the tests with @grpc/grpc-js@0.6.10
and @grpc/proto-loader@0.5.3
:
The socket leak is still there, I'll have to work on isolating it (hundreds of open connections):
When it comes to the memory leak, it's definitely better! Here is the before:
And here is the after:
500 MBs saved! I'll now work on reproducing the socket leak again.
Also, the socket leak doesn't necessarily mean that there is an issue with @grpc/grpc-js
, at least until I can provide a reproduction. This might be as well some mistake on my side. I personally don't think it should be blocking for releasing a new version, we can always ship yet another one with more fixes.
Actually, the memory leak is still reproducible using the code provided above! @murgatroid99, could you please double check if I'm correct? Maybe I missed something, but it seems to me that it's still there.
Using your original leak reproduction code, I definitely don't see the socket leak. After running the code for several minutes, I still see exactly two sockets open in the lsof
output and I see exactly two Node / Http2Session
objects in the heap dump, so I am fairly confident that the socket leak is fixed.
I do still see a steady memory leak, but none of the objects grpc creates seem to be increasing. The objects in the heap dumps that are increasing in quantity over time all seem to be V8 internal data structures related to compiling code. It may be the case that loading a .proto
file is inherently slightly leaky, because it is not just generating data, it actually generates classes and functions, which V8 needs to compile internally.
@merlinnot I've released a testing version of @google-cloud/pubsub
as @google-cloud/pubsub@next
(1.1.6-beta.0
). If you have the cycles, would you mind testing this out and letting us know if you see improvements to the memory issues you were seeing?
Hey, sorry for the late reply, I was waiting for another API to land (closing clients in Firestore and PubSub), but then I got caught into discussions about that one and forgot about your question.
The current status is that we've discovered yet another reason for leaks: Firestore uses multiple gRPC clients and these are not being closed, more here: https://github.com/googleapis/nodejs-firestore/issues/769, https://github.com/googleapis/gapic-generator-typescript/pull/126. So now I'm waiting for libraries to pick it up, so we have another source of leaks fixed, and then... I'll continue debugging.
With the fixes linked from this issue, unused open clients/channels shouldn't be a significant source of memory or socket leaks. Channels constructed with the same arguments should use the same underlying subchannels, subchannels that disconnect should be dropped from all channels and eventually released entirely, and unused channels should not create new subchannels.
I personally don't have a full vertical overview of all of the components and their internals, so I'm trying to help the best way I can: by providing reproductions, contributing fixes to issues I'm able to identify myself and opening issues for discussion, which most usually have a good reasoning behind them, like the one linked above.
What I understood from the internals of Firestore is that there is a pool of clients, but clients are sometimes dereferenced without properly closing them, see https://github.com/googleapis/nodejs-firestore/blob/a4efa097ef8de9ca4944356ab8767ddaa94c4188/dev/src/pool.ts.
Please see this thread (which was not opened by me), where a lot of different people experience massive memory leaks when using newer versions of Firestore. My experience is the same. I'm using VMs with 28.8 GBs of RAM for processes that I'd normally expect to run on ~2-3 GBs. This generates much higher costs. Others have the same problem: https://github.com/googleapis/nodejs-firestore/issues/768#issuecomment-554606861.
I tested the same process after mocking the library (so it didn't even load its files) and the leak was gone, so I'm fairly certain it's somewhere there.
At this point I can't tell if that's an issue with gRPC itself, Firestore or any other of its dependencies. I'm just trying to hammer out all of the leaks I see, step by step. I do understand that you have the knowledge to assess how much of an impact might a particular leak have, but I don't. So as soon as I find one, I report, get it fixed and look for a next one.
@merlinnot your hard work helping make this important library to the community better isn't unnoticed.
Thank you!
It looks much better after the fixes, I can't reproduce any leaks now. I hope that after implementation of the terminate
method in @google-cloud/pubsub
I'll be leak-free 🎉 If not, I'll work on a repro and open another issue. Thanks guys!
@merlinnot is there a tracking issue on @google-cloud/pubsub
for this? Thanks for all your help.
Yes, there is: https://github.com/googleapis/nodejs-pubsub/issues/817
Problem description
@grpc/grpc-js
v0.6.9 seemingly leaks both memory and open network connections.I did my best to provide a minimal possible reproduction by using gRPC directly, without any other layers on top, such as Google SDKs. Total size of the reproduction is ~50 LOC of readable TypeScript and uses
@grpc/grpc-js
,@grpc/proto-loader
,ramda
andpath
modules and packages exclusively. I hope it will make it easier to pinpoint and resolve an issue.I'd appreciate if you could confirm that the issue is reproducible by following the steps below.
Reproduction steps
npm install
.gcloud
or using a Docker container.configuration.ts
file if necessary.node -r ts-node/register --expose-gc index.ts
.lsof -i tcp:4500 -n -P
, adjust the port in this script to the one which Firestore Emulator is running on. A list of open sockets will be displayed.Environment
@grpc/grpc-js
v0.6.9Additional context
This might possibly be a root cause of https://github.com/googleapis/nodejs-firestore/issues/768.