waku-org / nwaku

Waku node and protocol.
Other
198 stars 51 forks source link

bug: mplex error upon reconnection #1485

Open fryorcraken opened 1 year ago

fryorcraken commented 1 year ago

Problem

js-waku (js-libp2p) automatically reconnects to remote nodes after disconnection/network access loss.

I can see the re-connection happening but it fails during the identify protocol.

I am still trying to pin down the exact error but would be keen to know if nwaku logs (from the fleet) can be extracted using the local timestamps.

Also keen to know what information I need to provide to help extraction of said logs.

Impact

I am adding critical label because disconnection has been the main issue platforms have complained about js-waku. Now that it makes to focus on it (refactor of js-waku), I would like this to be resolved.

To reproduce

If you go to https://examples.waku.org/eth-pm/ stop your internet access (wifi off) and on. You should be able to see that the Waku node tries to reconnect and successful does a libp2p noise handshake but fail during the identify protocol.

Expected behavior

Reconnection is successful

Screenshots/logs

Here are some logs, feel free to let me know what kind of information would be needed to find the nwaku fleet node logs:

09:53:58.071 libp2p:websockets outbound connection /dns4/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/tcp/8000/wss/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD upgraded +2s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.073 libp2p:peer-store:key-book:trace set got write lock +3ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.075 libp2p:dialer token 17 released +7s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.076 libp2p:peer-store:key-book:trace set release write lock +4ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.077 libp2p:connection-manager:trace checking limit of maxConnections. current value: 1 of 300 +4s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.078 libp2p:upgrader outbound: starting new stream on /ipfs/id/1.0.0 +10ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.079 libp2p:mplex new initiator stream 0 +15s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.079 libp2p:dialer dial succeeded to 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +4ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.081 libp2p:mplex:trace initiator stream 0 send +28s

Object { id: 0, type: "NEW_STREAM (0)", data: "0" }

[common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.082 libp2p:mss:select select: write ["/multistream/1.0.0", "/ipfs/id/1.0.0"] +16ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.085 libp2p:mplex:trace initiator stream 0 send +3ms

Object { id: 0, type: "MESSAGE_INITIATOR (2)", data: "132f6d756c746973747265616d2f312e302e300a0f2f697066732f69642f312e302e300a" }

[common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.116 libp2p:mplex:stream:trace initiator stream 0 close +28s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.117 libp2p:mplex:stream:trace initiator stream 0 closeRead +0ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.118 libp2p:mplex:stream:trace initiator stream 0 source end - err: undefined +1ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.118 libp2p:mplex:stream:trace initiator stream 0 closeWrite +1ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.119 libp2p:mplex:trace initiator stream 0 send +34ms

Object { id: 0, type: "CLOSE_INITIATOR (4)" }

[common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.120 libp2p:mplex:stream:trace initiator stream 0 sink end - err: undefined +1ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.120 libp2p:mplex initiator stream with id 0 and protocol undefined ended +42ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.123 libp2p:peer-store:proto-book:trace get wait for read lock +15s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.125 libp2p:mplex error in sink +4ms Error: The operation was aborted

AbortError abort-error.js:3  

nextAbortHandler index.js:29  

abortHandler index.js:10  

onAbort index.js:10  

close mplex.js:107  

close upgrader.js:452  

close index.js:92  

set upgrader.js:415  

set upgrader.js:424  

node_modules bundle.js:150950  

socketToMaConn socket-to-conn.js:46  

dial index.js:26  

dial transport-manager.js:79  

dialAction index.js:184  

run dial-request.js:70  

run dial-request.js:56  

_createPendingDial index.js:208  

dial index.js:98  

openConnection index.js:344  

_autoDial auto-dialler.js:92  

_timerWrapper retimer.js:18  

setTimeout handler*Retimer retimer.js:21  

retimer retimer.js:65  

_autoDial auto-dialler.js:102  

_timerWrapper retimer.js:18  

setTimeout handler*Retimer retimer.js:21  

retimer retimer.js:65  

_autoDial auto-dialler.js:102  

_timerWrapper retimer.js:18  

setTimeout handler*Retimer retimer.js:21  

retimer retimer.js:65  

_autoDial auto-dialler.js:102  

_timerWrapper retimer.js:18  

setTimeout handler*Retimer retimer.js:21  

retimer retimer.js:65  

_autoDial auto-dialler.js:102  

_timerWrapper retimer.js:18  

setTimeout handler*Retimer retimer.js:21  

retimer retimer.js:65  

_autoDial auto-dialler.js:102  

_timerWrapper retimer.js:18  

setTimeout handler*Retimer retimer.js:21  

retimer retimer.js:65  

_autoDial auto-dialler.js:102  

[common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.126 libp2p:peer-store:proto-book:trace get got read lock +3ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.130 libp2p:peer-store:proto-book:trace get release read lock +3ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.131 libp2p:upgrader:error could not create new stream +0ms Error: stream ended before 1 bytes became available

reader index.js:54  

[common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.133 libp2p:gossipsub connection ended 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +15s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js "View source in Debugger → http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/examples/eth-pm/node_modules/debug/src/common.js:108")  

09:53:58.135 libp2p:identify:error ./node_modules/it-reader/dist/src/index.js/reader/reader<@http://localhost:3004/eth-pm/static/js/bundle.js:164171:27 +0ms

nwaku version/commit hash

Whatever is deployed on /dns4/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/tcp/8000/wss/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD at the time of the issue creation.

fryorcraken commented 1 year ago

Note that this happened because my laptop (running js-waku in the browser) went to sleep. However, I was not able to reproduce the issue when I disconnect the browser from the internet (libp2p does not seem to notice the disconnection).

Removing the critical label as it seems to be an edge case.

alrevuelta commented 1 year ago

Thanks for reporting!

I am still trying to pin down the exact error but would be keen to know if nwaku logs (from the fleet) can be extracted using the local timestamps.

With this kibana link&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:ffcc22b0-0116-11ed-9719-cdd3b483481c,key:fleet,negate:!f,params:!(wakuv2.prod),type:phrases),query:(bool:(minimum_should_match:1,should:!((match_phrase:(fleet:wakuv2.prod)))))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:ffcc22b0-0116-11ed-9719-cdd3b483481c,key:logsource,negate:!f,params:(query:node-01.ac-cn-hongkong-c.wakuv2.prod),type:phrase),query:(match_phrase:(logsource:node-01.ac-cn-hongkong-c.wakuv2.prod)))),index:ffcc22b0-0116-11ed-9719-cdd3b483481c,interval:auto,query:(language:kuery,query:''),sort:!(!('@timestamp',desc)))) you should be able get the logs from node-01.ac-cn-hongkong-c.wakuv2.prod. Let me know if the link doesn't work.

Also keen to know what information I need to provide to help extraction of said logs.

Perhaps your peerId can help, since logs are quite verbose. With that it should be easier to search. We can also try to search based on timestamp, I can see it says 09:53:58.126. Is this UTC or your local time?

alrevuelta commented 1 year ago

Summarizing here Tanguy comments:

Looking for "Too many connections for peer" in the logs helps to know when this is happening on nwaku side.

Its expected, since the maximum number of connections per peer is 1 maxConnectionsPerPeer = 1. So when a peer that goes offline (without properly disconnecting) tries to reconnect again, the old connection is still tracked, so the new one is rejected. It can take up to 10 minutes for this connection to be cleared. Here's an example conns=2.

DBG 2023-01-12 09:22:51.231+00:00 Too many connections for peer              topics="libp2p connmanager" tid=1 file=connmanager.nim:393 conn=16U*dz5wdf:63bfd16b81b6ce33c2b2d928 conns=2

Note sure how easy would be to reproduce this scenario, but lets say that I'm connected to like 50 nodes and suddenly my internet is gone. Since no disconnection was made, that connection is still being kept as my connection by all 50 peers. So if I go online later in 5 minutes, I wont be able to connect to any of these peers? Could be a problem in a network without many peers (total_peers_in_network < max_peer_connections), so it can happen that the node can't connect to any peer during a max of 10 minutes.

fryorcraken commented 1 year ago

Perhaps your peerId can help, since logs are quite verbose. With that it should be easier to search. We can also try to search based on timestamp, I can see it says 09:53:58.126. Is this UTC or your local time?

Noted. Will provide peerId and clarify time (local) next time.

I'm connected to like 50 nodes and suddenly my internet is gone. Since no disconnection was made, that connection is still being kept as my connection by all 50 peers. So if I go online later in 5 minutes, I wont be able to connect to any of these peers? Could be a problem in a network without many peers (total_peers_in_network < max_peer_connections), so it can happen that the node can't connect to any peer during a max of 10 minutes.

I would not expect a js-waku peer to connection to 50 peers but yes, I agree with you.

@danisharora099 is starting to look into peer management on js-waku side. We will have to investigate browser/libp2p behaviour around disconnection/signal loss.

On nwaku side, does it make sense to timeout a connection after 10min? Also, what would be the impact if the timeout is reduced? Finally, how does the timeout works? ie, if data is sent over the connection, does it delays the timeout closure? However clever is the connection management on this aspect?

danisharora099 commented 1 year ago

Update: I can confirm that this does not happen by manually going offline but there's a weird inconsistent bug that is seen: the browser that goes offline can send messages after reconnection but can't send messages or the receiving is random.

Steps I followed:

  1. connected two wallets on different browsers (w1, w2)
  2. broadcasted pub keys and sent each other messages successfully
  3. went offline on browser w1 (chrome dev tools) and sent messages from both sides
  4. received the message on w2 but not on w1
  5. went back online and sent a message again

It was now seen that on spamming messages from w2, w1 seemed to receive one message of 5-6 messages sent while w2 seems to receive almost all messages sent.

I will post a consistently reproducible set of steps once I find it

Update: sending and receiving messages on eth-pm is highly inconsistent for me. Can't seem to put down a trend, but messages are randomly lost.

vpavlin commented 1 year ago

@danisharora099 are you still looking into this?

danisharora099 commented 3 months ago

@danisharora099 are you still looking into this?

(thanks @Ivansete-status for the ping)

not particularly this, but we do have some share of issues that track similar work: https://github.com/waku-org/pm/issues/143 (check liveness of a node)

also note that we have done some work on js-waku side (like keep-alive management, filter pings, etc) that also go intro improving reliability of this.

as for the current status of the original issue and the above repro, @adklempner did some investigations around liveness recently that might be more relevant here now