libp2p / js-libp2p

The JavaScript Implementation of libp2p networking stack.
https://libp2p.io
Other
2.27k stars 436 forks source link

investigation: hanging TCPSocketWrap in js-libp2p-amino-dht-bootstrapper #2537

Open SgtPooki opened 2 months ago

SgtPooki commented 2 months ago

related to https://github.com/libp2p/js-libp2p-amino-dht-bootstrapper/issues/18#issuecomment-2101025419

I will be adding various observations here while looking around at connection open and close code in js-libp2p

Destroying connections on MultiaddrConnection.close

1. timeout events not handled

https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/src/socket-to-conn.ts#L158-L178

When maConn.close is called, we try to explicitly close a socket, handling close and error events, but not timeout events which can definitely occur.

From https://nodejs.org/docs/latest-v20.x/api/net.html#socketsettimeouttimeout-callback

When an idle timeout is triggered the socket will receive a 'timeout' event but the connection will not be severed. The user must manually call socket.end() or socket.destroy() to end the connection.

This seems like it could easily lead to TCPSocketWrap continuing to increase.

Questions
  1. By calling setTimeout again, does it still allow the callback defined at https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/src/socket-to-conn.ts#L62C1-L75C5 to trigger? (it seems like that's the intention).
  2. If the original timeout event handler is called, and then we call maConn.close, will that same handler be called again or is it removed?

2. Manual socket.destroySoon logic

https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/src/socket-to-conn.ts#L183-L194

I'm not sure if we're intentionally not using socket.destroySoon but AFAIK, the only thing we would be missing by switching this code to socket.destroySoon is the log('%s socket drained', lOptsStr) log line, which we could easily add up above.

3. Socket 'drop' event

https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/test/max-connections.spec.ts#L64-L66

We have some tests that test our maxConnections, but we have no code that handles the 'drop' event in https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/src/socket-to-conn.ts.

Questions
  1. Is it possible that when reaching maxConnections & experiencing socket aborts/errors/timeouts, some of those dropped Sockets are not closed out properly?

4. Socket.unref

I don't see that we're calling socket.unref anywhere, and adding it kind of feels like a hack, but I think there are a few locations maConn.close and others where we could call socket.unref() to make sure we're explicit about when the socket is no longer needed.

e.g. once we call maConn.close and 'drain' is emitted, socket.unref() could be safely called?

Holistic Socket management

1. Transport class vs toMaConn socket event cleanup

https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/src/index.ts#L242-L245

TCP Transport class cleans up events it adds to sockets, why doesn't socket-to-conn.ts or listener.ts?

Questions
  1. Should we clean up allEventListeners in the TCP transport class?
  2. should socket-to-conn.ts be removing event handlers it adds, and if so, when?
  3. listener.ts doesn't remove event handlers either, should it?
SgtPooki commented 1 month ago

After running js-libp2p-amino-dht-bootstrapper over the weekend, starting with heapsnapshot->gc->heapsnapshot, then taking snapshots every 6 hours, and finishing this morning with heap->gc->heap, memlab gives the following result for analyze unbound-object:

╰─ ✔ ❯ ls -lhatr snapshot-dir
total 4.7G
-rw-r--r--  1 sgtpooki    0 May 10 19:00 .gitkeep
drwxr-xr-x 35 sgtpooki 1.1K May 10 19:07 ..
-rw-------  1 sgtpooki  37M May 10 19:20 2024-05-11T02:20:05.589Z.heapsnapshot
-rw-------  1 sgtpooki  42M May 10 19:20 2024-05-11T02:20:14.463Z.heapsnapshot
-rw-------  1 sgtpooki 129M May 10 19:22 2024-05-11T02:22:16.103Z.heapsnapshot
-rw-------  1 sgtpooki 120M May 11 00:00 2024-05-11T07:00:00.864Z.heapsnapshot
-rw-------  1 sgtpooki 197M May 11 06:00 2024-05-11T13:00:00.809Z.heapsnapshot
-rw-------  1 sgtpooki 314M May 11 12:00 2024-05-11T19:00:00.832Z.heapsnapshot
-rw-------  1 sgtpooki 337M May 11 18:00 2024-05-12T01:00:01.187Z.heapsnapshot
-rw-------  1 sgtpooki 391M May 12 00:00 2024-05-12T07:00:00.614Z.heapsnapshot
-rw-------  1 sgtpooki 483M May 12 06:00 2024-05-12T13:00:00.576Z.heapsnapshot
-rw-------  1 sgtpooki 438M May 12 12:00 2024-05-12T19:00:00.633Z.heapsnapshot
-rw-------  1 sgtpooki 468M May 12 18:00 2024-05-13T01:00:00.297Z.heapsnapshot
-rw-------  1 sgtpooki 497M May 13 00:00 2024-05-13T07:00:00.871Z.heapsnapshot
-rw-------  1 sgtpooki 466M May 13 06:00 2024-05-13T13:00:00.403Z.heapsnapshot
-rw-------  1 sgtpooki 473M May 13 07:50 2024-05-13T14:50:13.471Z.heapsnapshot
drwxr-xr-x 18 sgtpooki  576 May 13 07:51 .
-rw-------  1 sgtpooki 235M May 13 07:51 2024-05-13T14:51:12.831Z.heapsnapshot

╰─ ✘ INT ❯ npx -y memlab analyze unbound-object --snapshot-dir snapshot-dir
Top growing objects in sizes:
 (Use `memlab trace --node-id=@ID` to get trace)

· system / Context [object](@387349):  3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB
· MplexStreamMuxer [object](@347089):  32KB > 47KB > 47.2KB > 49.2KB > 150.7KB > 168.4KB > 231.4KB > 318KB > 423KB > 510KB > 527.4KB > 562.1KB > 596.7KB > 614.2KB > 614.2KB
· Object { initiators, receivers } [object](@347195):  21.1KB > 38KB > 38KB > 31.3KB > 132.8KB > 150.4KB > 213.4KB > 300KB > 405.1KB > 492.1KB > 509.5KB > 544.2KB > 578.7KB > 596.2KB > 596.3KB
· Map [object](@347355):  20.7KB > 37.7KB > 37.7KB > 31KB > 132.6KB > 150.2KB > 213.2KB > 299.8KB > 404.9KB > 491.9KB > 509.3KB > 544KB > 578.5KB > 596KB > 596.1KB
· system / Context [object](@122957):  531.8KB > 532KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB
· ChaCha20Poly1305 [object](@122961):  530KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB
· system / Context [object](@443697):  3.7KB > 3.7KB > 3.7KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB
· Map [object](@709657):  184 bytes > 184 bytes > 184 bytes > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB
· Providers [object](@382375):  5.9KB > 27.6KB > 89.9KB > 390KB > 362.5KB > 374.8KB > 278.4KB > 227.4KB > 214.7KB > 226.7KB > 282.6KB > 259.7KB > 275.1KB > 206.3KB > 217.6KB
· RoutingTable [object](@350995):  5.6KB > 52.6KB > 6.2KB > 6.6KB > 6.6KB > 6.6KB > 297.6KB > 274KB > 6.6KB > 6.6KB > 303.9KB > 275.4KB > 284.9KB > 235.8KB > 217.5KB
· Object { clear, get, has, remove, set } [object](@382403):  1.8KB > 23.3KB > 85.7KB > 385.5KB > 358.4KB > 370.3KB > 273.9KB > 222.9KB > 210.2KB > 222.2KB > 278KB > 255KB > 270.4KB > 201.7KB > 212.9KB
· KBucket [object](@351629):  33.3KB > 46.7KB > 54.3KB > 249.1KB > 254KB > 146.7KB > 291KB > 267.3KB > 262KB > 262KB > 297.2KB > 268.8KB > 278.3KB > 229.1KB > 210.9KB
· system / Context [object](@382739):  1.4KB > 22.2KB > 84.5KB > 384KB > 356.5KB > 368.3KB > 271.8KB > 220.7KB > 207.9KB > 220KB > 275.8KB > 252.8KB > 268.2KB > 199.4KB > 210.7KB
· Object { contacts, dontSplit, left, right } [object](@382903):  28.2KB > 41.2KB > 49KB > 243.2KB > 247.6KB > 141.1KB > 284.1KB > 260.6KB > 255.6KB > 255.6KB > 290.2KB > 261.9KB > 271.2KB > 222.5KB > 203.8KB
· Object { contacts, dontSplit, left, right } [object](@382943):  21.9KB > 34.6KB > 39KB > 205.9KB > 225.7KB > 132.5KB > 245.1KB > 223.2KB > 215.1KB > 206.4KB > 256.2KB > 221.8KB > 232.4KB > 194.8KB > 177.2KB
· AsyncGenerator [object](@59297):  33.8KB > 33.9KB > 113.3KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB
· BuiltinModule [closure](@399391):  159.4KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 159.4KB > 163.9KB > 163.9KB > 163.9KB > 159.4KB > 163.9KB
· Map [object](@433067):  158.1KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 158.1KB > 162.7KB > 162.7KB > 162.7KB > 158.1KB > 162.7KB
· Object { contacts, dontSplit, left, right } [object](@272937):  17.3KB > 27.6KB > 27.6KB > 173.5KB > 208.6KB > 117.2KB > 214.7KB > 204.2KB > 190.8KB > 194.4KB > 218.4KB > 183.9KB > 191.7KB > 168.9KB > 150.5KB
· Object { contacts, dontSplit, left, right } [object](@272941):  5.3KB > 7.7KB > 22KB > 123.7KB > 153.8KB > 105.9KB > 159.5KB > 173.4KB > 185.1KB > 173.6KB > 161.4KB > 153.3KB > 157.3KB > 133.1KB > 138.6KB

I don't see anything obvious here.