waku-org / js-waku

JavaScript implementation of Waku v2
https://js.waku.org
Apache License 2.0
169 stars 42 forks source link

Peer Management: Connection and Disconnection #914

Closed fryorcraken closed 1 year ago

fryorcraken commented 2 years ago

Planned start date: Due date:

Summary

In a browser environment, loss of connectivity may happen.

When it happens, this would impede the application to receive and send messages.

Strategies need to be designed to:

Proposed Solutions

The result of this issue would be a mix of:

At this stage it is not sure what should go in waku core and what should be a library helper.

Acceptance Criteria

Tasks

Tasks Moved to Out of Scope

Notes

https://github.com/libp2p/js-libp2p/issues/744

RAID (Risks, Assumptions, Issues and Dependencies)

fryorcraken commented 1 year ago

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable
danisharora099 commented 1 year ago

Since this issue is a considerable overhaul, I'd like to be sure of my interpretation of the same.

I think the biggest one is:

* detect a loss of connection

For this, my interpretation is that libp2p's connection manager doesn't emit an event when a peer disconnects/streams are closed (on peer:disconnect) (ref: https://github.com/status-im/status-web/pull/288#discussion_r911672852) and that this will either be something we'll be handling ourselves (similar to https://github.com/status-im/status-web/pull/288 or https://github.com/xmtp/xmtp-js/pull/128) or an upstream change to libp2p (cont of https://github.com/libp2p/js-libp2p/issues/939 / https://github.com/libp2p/js-libp2p/issues/744) Please let me know if that's a fair interpretation.

Secondly, an explanation on the following, of how this fits with the initially described scope of the PR, will also be greatly appreciated.

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable

Other than this, appreciate that the issue has been written very descriptively and will have a great impact!

cc @fryorcraken @felicio

fryorcraken commented 1 year ago

As mentioned. this is a considerable overhaul and would need to be tackled iteratively. Each iteration can increase the functionality/complexity of the manager while taking onboard feedback from dogfooding.

I suggest to focus the first iterations on using the peer discovery protocols dns discovery and peer exchange with the following logic:

  1. when using createLightNode, apply default values of wanting 2 light push nodes, 2 filters nodes and 1 store node (user can override values).
  2. DNS discovery bootstrap method is used, it discovers and emit nodes that are tagged as bootstrap
  3. Connection Manager processes the bootstrap node i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic) ii. Use peer exchange to discover more nodes
  4. Connection Manager discards/saves for later other bootstrap nodes
  5. peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
  6. peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
  7. connect to more peer-exchange node to fulfill requirements set in (1) if needed.

(3) introduces an issue: the first node discovery by DNS discovery will always be the same given a given enr tree. Hence we may need to introduce some pseudo random shuffling logic. this should probably be done in the connection manager.

The final result above can already be split in several tasks/PRs.

Then, we can add error resilience on top of it, I'd recommend in the following order:

  1. Fail to dial a node: handle this and try another node (easy as all located in connection manager)
  2. protocol failed on a node (ie, light push query failed): disconnect node and try other node: this is harder because the protocol needs to feedback the failure to the manager which can then handle the disconnect/connection. Also the protocol should know whether the error is worthy a disconnection. This is not really feasible now (no differentiations in errors) but may be with the waku store protocol overhaul).
  3. Node disconnects: as you mentioned, we need to have feedback from libp2p for that so it may be more difficult.
danisharora099 commented 1 year ago
  1. Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?

  • Tag the peer within the PeerStore
  • Dial

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)

  • why do we want to “later remove this logic”?
  1. Connection Manager discards/saves for later other bootstrap nodes
  • How does the ConnectionManager do that? Do we mean the PeerStore/ AddressManager?
  • (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
  1. peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
  2. peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
  1. Fail to dial a node: handle this and try another node (easy as all located in connection manager)
  • how can we reproduce a non-dialable node?
  • in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with autoDial:true (which is used by default), all nodes are tried to connect to)

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: https://github.com/waku-org/js-waku/issues/1117)

Please let me know if I'm missing something

fryorcraken commented 1 year ago
  1. Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?

* Tag the peer within the PeerStore

The tagging is done by the discovery service: https://github.com/waku-org/js-waku/blob/0b083201c6f585ad1b3267c1571df5f1b81a1355/packages/peer-exchange/src/waku_peer_exchange_discovery.ts#L92

* Dial

The dialling is done by the connection manager: https://github.com/waku-org/js-waku/blob/0b083201c6f585ad1b3267c1571df5f1b81a1355/packages/core/src/lib/waku.ts#L140

We will have to make this trivial logic more complex in the future and have a proper ConnectionManager class/module.

By "process" I mean deciding whether to dial and dialing.

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)

* why do we want to “later remove this logic”?

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes. It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes.

Instead, I propose for the bootstrap node to only be used for bootstrapping, ie, getting access to the network. meaning only used for discv5 and peer-exchange.

  1. Connection Manager discards/saves for later other bootstrap nodes
* How does the `ConnectionManager` do that? Do we mean the `PeerStore`/  `AddressManager`?

I don't know. The node will be tagged as bootstrap and store in peerStore so I guess that's good enough. Or maybe we'll have to also store the nodes in the ConnectionManager. We'll have to decide at implementation time if the peerStore gives us enough.

The ConnectManager is the new module that handles most of the logic described here.

* (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
  1. peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
  2. peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
* “that’s our second node” — this step is skipped if the initial node requirements are met, correct?

Yes. For the rest of the logic I assume the proposed default requirements are applied.

* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

  1. Fail to dial a node: handle this and try another node (easy as all located in connection manager)
* how can we reproduce a non-dialable node?

Pass an invalid multiaddr.

* in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with `autoDial:true` (which is used by default), all nodes are tried to connect to)

We'll have to review in details the retry logic when it's time to implement.

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:

* discovery, tagging, dial attempts, etc are all handled implicitly

We should not dial every node we discover. This is why custom connection management is needed.

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: #1117)

Please let me know if I'm missing something

danisharora099 commented 1 year ago

By "process" I mean deciding whether to dial and dialing.

How are we deducing whether to dial or not? If there's a peer available other than bootstrap, we give it more preference and dial to it instead of the bootstrap peer - correct? (to increase decentralisation)

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes. It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes. That makes sense! We should prioritize nodes found by discovery mechanisms other the bootstrapping. Noted.

* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

We should not dial every node we discover. This is why custom connection management is needed.

My interpretation so far: Our connection priority should be something along the lines of:

is that a fair interpretation?

danisharora099 commented 1 year ago

Also, as concluded from https://github.com/waku-org/js-waku/issues/1117#issuecomment-1376898390, libp2p by default autodials.

Considering the scope of this PR, it's best to default autoDial to false considering we don't want to connect to all discovered bootstrap nodes immediately.

fryorcraken commented 1 year ago

My interpretation so far: Our connection priority should be something along the lines of:

* connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes)

* find more peers using other discovery mechanisms, via the bootstrap peer

* connect to all new peers found until requirements are fulfilled

is that a fair interpretation?

Yes.

fryorcraken commented 1 year ago

As discussed with @danisharora099 the first step was actually a focus on connections to node discovered, Need to update description to record this first step.

Next, would be exploratory work to understand what information we can get when a node is disconnected.

fryorcraken commented 1 year ago

Blocked by https://github.com/waku-org/js-waku/issues/1412

fryorcraken commented 1 year ago

This milestones needs to be split. Some proposal:

fryorcraken commented 1 year ago

edit: Please continue discussion below to https://github.com/waku-org/js-waku/issues/1463

Saving peers in local storage should be part of one of the milestone. However, it is not a trivial endeavour. Being able to connect to a peer is a good first step. However, it does not mean it's a quality peer. For example, what if this peer is not connected to the relay network and hence does not forward any message? From the moment we are moving away from the bootstrap fleet, we enter the decentralized domain and hence peers cannot be trusted, even if not malicious. I am guessing the solution is some sort of peer scoring done for light protocol.

For example, local node connects to Alice and Bob and does the same filter subscription for both:

  1. Alice's score goes up when Alice push a message via fliter
  2. Alice's score goes up when Alice is the first one to forward a message (ie, Bob forward the message after)
  3. Bob's score goes down when he does not forward a message that Alice has forwarded within 5s

Then, for light push:

Send a message via Alice light push, message not ever received from Bob and Carole -> decrease Alice score and use David to re-send message

This needs to formalized with help from @waku-org/research and probably part of 2.1 https://github.com/waku-org/research/issues/3

danisharora099 commented 1 year ago

referring from https://github.com/waku-org/js-waku/issues/1412,

upon further investigation and taking into context previous findings, it's safe to conclude:

  • connection:close: monitor a permanent connection close between the local & remote node (which should usually be triggered along with peer:disconnect for cases where we only have one connection with the remote node)
  • peer:disconnect: monitor permanent disconnections with peers (this would imply not that no the underlying connection(s) have been permanently closed, and the only way to communicate with this peer again is to open a new connection/reconnect)
  • pings will fail when there are temporary network degradations or reachability issues. this does not mean that the underlying connection has been closed.

to address part of the acceptance criteria from this milestone, the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

as we concluded with https://github.com/waku-org/js-waku/issues/1403#issuecomment-1664205504, js-waku redials automatically after the 10 mins mark, which is not a conscious disconnection and requires reconnection so perhaps the above question is enforced by default already?

cc @fryorcraken

danisharora099 commented 1 year ago

Weekly Update

danisharora099 commented 1 year ago

the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

for this, we want to simply attempt reconnections


For library consumers,

For Filter,

danisharora099 commented 1 year ago

Weekly Update

danisharora099 commented 1 year ago

Weekly Update