hyperledger / fabric-sdk-node

Hyperledger Fabric SDK for Node https://wiki.hyperledger.org/display/fabric
https://hyperledger.github.io/fabric-sdk-node/
Apache License 2.0
792 stars 518 forks source link

Memory leak when gateway.connect and gateway.disconnect are called repeatedly #529

Open Ryu-Shinzaki opened 2 years ago

Ryu-Shinzaki commented 2 years ago

We found memory-leak behavior when we call gateway.connect and gateway.disconnect repeatedly, though we expected that gateway.disconnect cleans up.

We tried executing the code attached to this issue on the following environments.

Then, we got logs as follows:

initial state
rss [B], heapTotal [B], heapUsed [B], external [B], arrayBuffers [B]
65703936, 54411264, 17494312, 1593216, 125449
initializing network instance
Creating an gateway object
65773568, 54935552, 17339072, 1584432, 85154
Executing connect function
65773568, 54935552, 17365152, 1584576, 78976
Executing disconnect function
65773568, 54935552, 17366976, 1584576, 78976

... (repeated some times)

Executing connect function
89698304, 56246272, 19282304, 1769693, 262115
Executing disconnect function
89698304, 56246272, 19181592, 1769549, 251008

... (memory usage increases)

According to these logs, the memory usage increases every time we call connect and disconnect functions.

We attached the code to this issue for reproduction. sdk-sample.tar.gz

bestbeforetoday commented 2 years ago

I can observe similar characteristics over a 5 minute run of your test application, with an initial state of

Memory usage: rss=73707520, heapTotal=54771712, heapUsed=17472296, external=1772632, arrayBuffers=79336

And at the end of the 5 minute run of:

Memory usage: rss=214474752, heapTotal=89112576, heapUsed=40513000, external=6138530, arrayBuffers=4477856

Then after a few seconds pause to give the garbage collector a chance to do some cleanup a final state of:

Memory usage: rss=215228416, heapTotal=56606720, heapUsed=40342008, external=6179602, arrayBuffers=4518768

It would need some profiling of the heap to identify exactly what was using the space and confirm it isn't caused by some other aspect of Node runtime or heap management.

I must point out that creating and discarding large numbers of connections is not good practice. You generally want to keep your Gateway connection around and use it for all work carried out by a client identity.

bestbeforetoday commented 2 years ago

One other thing is that the sample client code you posted does not wait for completion of the async connect() call before then going on to call disconnect(). This is incorrect usage but I'm not sure it contributes to the memory behaviour you observe.

vchristaras commented 2 years ago

I have seen the same behaviour and noticed that the grpc connections remain open both on the client as well as the peer. Running the same version of the sdk, fabric-peer 2.2.2.

VaanPan commented 2 years ago

We notice the same issue here. In our project we create gateway-> connect -> disconnect. The memory keeps leakage and if we stop client process, client and peer memory suddenly released.

galadj commented 2 years ago

Aside from memory problems, this also causes connections to remain open. This can eventually cause a server to run out of available connections, as also reported here: https://stackoverflow.com/questions/49695485/network-leak-issue-with-event-hub-and-peer-connections-in-fabric-client-node-js

I understand the intention is not to use gateways rapidly, but that should be a performance consideration, and not a reason to leave memory and connections hanging.

Has anyone found a solution to this, other than forcibly restarting the node process the gateway is connecting through?

graphen007 commented 2 years ago

Any news on this?

dzikowski commented 2 years ago

I managed to reduce connection leak by doing two things (both):

  1. closing endorsers connections specified in connection profile just after gateway connects
  2. manually closing discovery service connection at the end

It worked in my case (for connection leaks, I didn't check memory leaks), but it was a lot of debugging and experiments. It reduced all connection leaks for successful chaincode invocations, but left some when chaincode invocation failed. Maybe it will help someone with debugging this stuff. I will probably give up and use Fabric Gateway for Fabric 2.4 anyway (https://github.com/hyperledger/fabric-gateway).

await gateway.connect(connProfile);
peerNames.forEach(peerName => {
  // @ts-ignore
  gateway.client.endorsers.get(peerName).disconnect(); // those connections will be replaced by new ones, but there will be no leak
});

const network = await gateway.getNetwork(channelName);
const fabricDiscoveryService = network.discoveryService; // get the reference here, not after calling contract

...

fabricDiscoveryService.close();
gateway.disconnect();
salimbene commented 1 year ago

I've experiencing the same issue on Debian11, with fabric node SDK 2.2.15, and fabric 2.4.x. I will attempt the workaround proposed by @dzikowski .

bestbeforetoday commented 1 year ago

A slight change to connection closing was deliver in the v2.2.17 release, which might help with this issue. The workaround mentioned above does not seem ideal but the handling of connections created during connection profile handling does sound like a good candidate for the problem area - thank you @dzikowski for the investigation.

If you are using (or can use) Fabric v2.4 or later, you should use the Fabric Gateway client API, which has much more efficient connection behaviour. It can use a single gRPC connection (over which you have direct control) for all interactions with Fabric, regardless of the number of client identities you are working with. See the migration guide for details.

bh4rtp commented 1 year ago

I use sdk 2.2.18. Connect the fabric network and submit tens of transactions and then disconnect the network. It also ends with core dumped.

node[30909]: ../src/node_http2.cc:561:static void* node::http2::Http2Session::MemoryAllocatorInfo::H2Realloc(void*, size_t, void*): Assertion `(session->current_nghttp2_memory_) >= (previous_size)' failed.
 1: 0x8fb090 node::Abort() [node]
 2: 0x8fb165  [node]
 3: 0x95ecfa  [node]
 4: 0x1738b28 nghttp2_session_close_stream [node]
 5: 0x173fe8a nghttp2_session_mem_recv [node]
 6: 0x95af67 node::http2::Http2Session::ConsumeHTTP2Data() [node]
 7: 0x95b1ef node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) [node]
 8: 0xa2cc21 node::TLSWrap::ClearOut() [node]
 9: 0xa2cfc0 node::TLSWrap::OnStreamRead(long, uv_buf_t const&) [node]
10: 0x9d1021  [node]
11: 0xa7d3d9  [node]
12: 0xa7da00  [node]
13: 0xa83b58  [node]
14: 0xa71bbb uv_run [node]
15: 0x905665 node::Start(v8::Isolate*, node::IsolateData*, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&) [node]
16: 0x90374f node::Start(int, char**) [node]
17: 0x7f121e0e9445 __libc_start_main [/lib64/libc.so.6]
18: 0x8bce95  [node]
bestbeforetoday commented 1 year ago

I use sdk 2.2.18. Connect the fabric network and submit tens of transactions and then disconnect the network. It also ends with core dumped.

Is this a new problem that worked with previos SDK versions and has just started appearing with v2.2.18? It looks like a physical memory allocation failure in the Node runtime so it might be worth checking the version of Node you are using and also monitor the system memory used and available to the Node process.

ConstantRohmer commented 3 months ago

Hi, I originally posted this on the discord server, but I came across this issue which applies to me. I am using HLF v3.0.0-beta and SmartBFT as consensus.

I see a linear increase in the memory usage of my orderers (all of them not only the leader) until a certain point (between 400 and 750MB, although I don't have an overflow: neither for my entire system nor for the docker containers). The limit of this increase doesn't seem to be the value of memory used but the time after which it stops increasing, because no matter the value it is at, it practically stops increasing after ~10h (x-axis = 20k) We can clearly see that on the plot I added. Does anyone have any explanations ?

image

More explanations on the plot itself and my experimentations : The plot represents the Evolution of memory usage of orderers and peers nodes for 4 different runs that each lasted 15h. (The x-axis is not very clear, but 1 unit corresponds to ~7s which makes the total around 58h = 4x ~15h) I used multiple channels (5 everytime) for each of my runs, but depending on the run the channels are not updated at the same frequency which would perfectly explain the difference between the 2 orange lines on runs 2 and 4. Each run is actually a repetition (7500x here) of the same actions (some write and reads) and I open a gateway (and close it at the end) for every repetition. According to what was mentioned in this issue, I believe this could be the problem, but I thought that using the Gateway should fix this so I don't know if there is something I could do (except from resetting the gateway of course : this is maybe doable in my case but would deeply modify my implementation)

In any cases this doesn't explain the increase stopping at x=~20k for every runs I made which are not coherent with a memory leak but maybe there is a link that I am not able to see...

bestbeforetoday commented 3 months ago

@ConstantRohmer This repository contains the (legacy) SDK for Node.js client applications, and this issue relates to Node.js client application memory usage. For issues relating to the Fabric orderer, you need to use the core Fabric repository.

ConstantRohmer commented 3 months ago

@ConstantRohmer This repository contains the (legacy) SDK for Node.js client applications, and this issue relates to Node.js client application memory usage. For issues relating to the Fabric orderer, you need to use the core Fabric repository.

Sorry for that, I didn't pay attention to the repo this issue was in. However now that I know that, the gateway connection opening/closing should not increase the orderer's memory usage but only the peer's usage, so the problem I am experiencing is probably not due to that (at least not only)

I will switch to the correct repo and ask my question again !