waku-org / nwaku

Waku node and protocol.
Other
198 stars 50 forks source link

bug: peer exchange returns nodes that no longer exist. #2414

Closed richard-ramos closed 2 days ago

richard-ramos commented 6 months ago

@chaitanyaprem while on a debuging session with the status team, discovered that sometimes while attempting to use lightpush to publish a message, it would fail even after 5 retries. On further debugging noticed that each time a random peer was selected for lightpush and the sender was unable to dial to it. All these peers are returned via peer-exchange from the boot nodes.

The most probably cause as considering the connection failed in each scenario with a dial timeout could be that these nodes are no longer even available, otherwise the TCP connection would have been established and p2p level connection would have failed.

12220 ERROR[02-09|07:17:44.125|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/lightpush/waku_lightpush.go:309] could not publish message                      hash=0x2fe09b470bc8f13be52abd09930a5f3eb6ae274ae4678a985dd679edc6deb7c1 pubsubTopic=/waku/2/rs/16/32 contentTopic=/waku/1/0x4cdd76ba/rfc26 timestamp=1      ,707,463,059,122,842,165 peerID=16Uiu2HAmN2n45zWeNhXtQis1kp7Wrun4pz1NFhDi9uBZQaarmBLq error="failed to dial: failed to dial 16Uiu2HAmN2n45zWeNhXtQis1kp7Wrun4pz1NFhDi9      uBZQaarmBLq:\n  * [/ip4/95.217.224.92/tcp/44647] dial tcp4 0.0.0.0:39663->95.217.224.92:44647: i/o timeout"                                  

Further discussion on Discord with @alrevuelta revealed that in nwaku every 15 minutes it caches peers that are reachable (connected succesfully in the past), and serves the to peer exchange requests https://github.com/waku-org/nwaku/blob/master/waku/waku_peer_exchange/protocol.nim#L118-L131 , however, to select reachable peers, the criteria it uses to choose the peers is those whose connectedness is either Connected or CanConnect, which in https://github.com/waku-org/nwaku/blob/master/waku/node/peer_manager/peer_manager.nim#L395-L397 could mean that a peer that was connected will be considered forever that it was CanConnect thus be able to be potentially selected for the peer exchange response even if the node has not been seen anymore.

A potential solution suggested by @alrevuelta is to cache Connected peers and that's it. But this has some privacy concerns (graph learning attacks), or maybe add a lastSuccesfullConn (similar to the existing lastFailedConn) and then filter based on that, (i.e. only allow few hours of discrepancy)

chaitanyaprem commented 5 months ago

One thing we had noticed after further debugging is that discv5 was being run in lightClient in status-mobile which must have cauesd fleet nodes to return lightNodes as peers as well. But since this was happening in mobile-CI where lot of nodes get created and stopped (as tests proceed), these nodes must have become stale.

But as mentioned above, stale peers are not being removed as quickly as expected which could be a possiblity in a network where peerID's keep changing for every restart. That still needs to be addressed.

chaitanyaprem commented 2 months ago

As part of status dogfooding i have noticed that many peers returned by peerExchange or discv5 are not reachable when a lightClient tries to connect to them.

I am wondering the way we are checking if a peer is reachable is the right way or not. https://github.com/waku-org/nwaku/blob/93e9ba22aa01121217fd7e44d592ae30b8074599/waku/waku_peer_exchange/protocol.nim#L131-L136 is the peerExchange cache code that filters peers that are reachable.

https://github.com/waku-org/nwaku/blob/93e9ba22aa01121217fd7e44d592ae30b8074599/waku/node/peer_manager/waku_peer_store.nim#L145-L148 seems to be used to determine if a peer is reachable.

But it only checks for if a peer has connected to the node currently or in the past. This doesn't gaurantee if anyone can connect to the peer directly, because the peer could be behind a strict NAT or the peer maybe exposing a circuit-relay address as part of its ENR. Such things can only be verified if the nwaku node tries to connect to the peer and then records reachability status.

Not sure how discv5 is determining if a peer is reachable or not. That also needs to be verified.

cc @richard-ramos @gabrielmer

This is different than what is reported as part of https://github.com/waku-org/nwaku/issues/2810

gabrielmer commented 2 months ago

seems to be used to determine if a peer is reachable. But it only checks for if a peer has connected to the node currently or in the past. This doesn't gaurantee if anyone can connect to the peer directly, because the peer could be behind a strict NAT or the peer maybe exposing a circuit-relay address as part of its ENR. Such things can only be verified if the nwaku node tries to connect to the peer and then records reachability status.

Makes sense! Should we add another filtering round in which we ping each peer before sharing them? The question is whether this extra load should be assumed by the client or the "server".

In other words, if I'm running node A, I'm connected to node B (which is behind a NAT) and node C asks us for peers, should node A share the info of node B in good faith without actually being aware it's behind a NAT and node C figures out by itself if it works for it or not.

Or should node A do the extra work and verify that all nodes shared are externally reachable, even if its already connected to them, acting in good faith and against its own interests to do that extra work.

Apart from the NAT case, I personally like @richard-ramos suggestion of

maybe add a lastSuccesfullConn (similar to the existing lastFailedConn) and then filter based on that, (i.e. only allow few hours of discrepancy)

So at least we filter most of the nodes that don't even exist

Not sure how discv5 is determining if a peer is reachable or not. That also needs to be verified.

Me neither 😶 looked at the code but we don't handle the core logic behind discv5 from our side. If I'm not mistaken, we almost just instantiate it, initialize it and get the nodes. Should we go deep into its implementation details? We don't have much control about it but if we find there's a bug we can report it to nim-eth

chaitanyaprem commented 2 months ago

Apart from the NAT case, I personally like @richard-ramos suggestion of

maybe add a lastSuccesfullConn (similar to the existing lastFailedConn) and then filter based on that, (i.e. only allow few hours of discrepancy)

So at least we filter most of the nodes that don't even exist

This seems like a good start.

chaitanyaprem commented 2 months ago

Me neither 😶 looked at the code but we don't handle the core logic behind discv5 from our side. If I'm not mistaken, we almost just instantiate it, initialize it and get the nodes. Should we go deep into its implementation details? We don't have much control about it but if we find there's a bug we can report it to nim-eth

Ah, Ok. Got it. I think as of now the only bug seems to be that why is it taking soo long for a node to be discovered which anyways seem to be tracked in a separate nwaku issue.

darshankabariya commented 1 month ago

Me neither 😶 looked at the code but we don't handle the core logic behind discv5 from our side. If I'm not mistaken, we almost just instantiate it, initialize it and get the nodes. Should we go deep into its implementation details? We don't have much control about it but if we find there's a bug we can report it to nim-eth

Ah, Ok. Got it. I think as of now the only bug seems to be that why is it taking soo long for a node to be discovered which anyways seem to be tracked in a separate nwaku issue.

After thoroughly reviewing the codebase, I found that the issue is not with the peer_exchange protocol itself. The problem lies in the poor management of stale peers within the peer_manager. Since peer_exchange fetches information from peer_manager, if peer_manager does not have reliable peers, it will reflect as an issue in the peer_exchange side as well. To address this, I have created a separate issue for #2905.

Following a discussion with Hanno, we concluded that reducing the refresh interval from 15 minutes to 3-5 minutes would allow the peer_exchange protocol to fetch updated peer information more frequently from peer_manager. Additionally, Gabriel suggested adding an extra layer of filtering. One idea is to create a tradeoff: if a client requests a small number of peers (e.g., 1-3), we can filter to ensure they are connectable. If a client requests more peer then we can directly share the peer list. This approach would use resources more efficiently.

The rationale behind this solution is that if a user asks for only 1-2 peers, it means no other options remain on the client side, and the client will need to send another request to the node if the peers do not connect. The node would need to serve that request again. However, if a user requests a bulk of peers (e.g., 5-10), there is a high chance that more than half will successfully connect.

What do you guys think ? any comment is helpful.

@gabrielmer @chaitanyaprem @Ivansete-status @jm-clius

jm-clius commented 1 month ago

I guess what confuses me is that we already only respond with "reachable peers" (in other words apply filtering to all responses, regardless of number of peers requested) or am I missing something? See https://github.com/waku-org/nwaku/blob/e269dca9cdc56cfd3f29cbabe005f737da5aebe6/waku/waku_peer_exchange/protocol.nim#L134

darshankabariya commented 1 month ago

I guess what confuses me is that we already only respond with "reachable peers" (in other words apply filtering to all responses, regardless of number of peers requested) or am I missing something? See

https://github.com/waku-org/nwaku/blob/e269dca9cdc56cfd3f29cbabe005f737da5aebe6/waku/waku_peer_exchange/protocol.nim#L134

Here, a reachable peer means it is either already connected or can be connected (though we are not 100% certain about its liveliness, in the past, we were connected with that peer so we put it in the canconnect category ).

https://github.com/waku-org/nwaku/blob/93e9ba22aa01121217fd7e44d592ae30b8074599/waku/node/peer_manager/waku_peer_store.nim#L145-L148

image

So, we either need to update the reachablePeer function or add a filter when responding. I believe we should add the filter during the response phase because this function is also used by other places.

@jm-clius

jm-clius commented 1 month ago

As discussed in the meeting, I would avoid adding another layer of filtering that may involve active pinging on the service node side - not only will this delay responses significantly, it will blur the line even further between the peer exchange service and the node's own connectivity information (the latter of which should not be divulged for security reasons). I think we should first evaluate how much just increasing the px-cache refresh rate will improve things.