waku-org / js-waku

JavaScript implementation of Waku v2

https://js.waku.org

Apache License 2.0

169 stars 42 forks source link

Peer Management: Connection and Disconnection #914

Closed fryorcraken closed 1 year ago

fryorcraken commented 2 years ago

Planned start date: Due date:

Summary

In a browser environment, loss of connectivity may happen.

When it happens, this would impede the application to receive and send messages.

Strategies need to be designed to:

detect a loss of connection
Automated actions when resuming a connection
- auto reconnect to remote peer
- Auto trigger some protocols (e.g. Waku Store #252)
- Renewing Filter subscription
- Waiting for peer to be in mesh relay before considering the action live (eg waitForRemotePeer)
cope with lack of connectivity when trying to use a protocol: retry strategy, feedback to the app and guideline on how to handle
- Querying message using Waku Store: eg #913
- Trying to send a message with light push

Proposed Solutions

The result of this issue would be a mix of:

Documentation to guide the developers in handling failure
Strategy embedded in the protocols (e.g. retry)
Helper library that can provide utility to developers (e.g. auto store query, auto filter renewal)

At this stage it is not sure what should go in waku core and what should be a library helper.

Acceptance Criteria

[x] Upon start-up, app dial a reasonable set of peers from peer discovery methods (no persistence, only dial 1 bootstrap peer)
[x] Upon start-up, if some peers are not reachable, app attempt fallback method (try 2nd bootstrap peer)
[x] Following above rules, when disconnected to a remote peer, attempts to reconnect
[x] Clear guidelines for developer to know when node is connected or disconnected https://github.com/waku-org/docs.waku.org/issues/104#issuecomment-1759122214

Tasks

[x] setup manual controlled connection management for peers (some work being done here: https://github.com/waku-org/js-waku/pull/1135/)
- [x] remove libp2p's autodial
- [x] dial peer on the basis of set requirements, redial if necessary, stop dialing after N attempts
- [x] connect to minimum bootstrap nodes (1) and give preference to other discovery mechanisms (peer-exchange)
- [ ] ~~dial all available bootstrap peers if no PX peers are dialable after N seconds~~
[x] Investigate the effect of connection loss -> tracked here: https://github.com/waku-org/js-waku/issues/1412
- [x] libp2p events w.r.t. connected/disconnected peers (https://github.com/waku-org/js-waku/issues/1189#issuecomment-1469449583)
- [x] auto-connection to disconnected peers
- [x] https://github.com/waku-org/nwaku/issues/2116
- [x] https://github.com/waku-org/js-waku/issues/1646
- [x] https://github.com/waku-org/docs.waku.org/issues/104#issuecomment-1759113733
- [x] https://github.com/waku-org/js-waku/pull/1657

Tasks Moved to Out of Scope

[ ] Waku Filter subscription state after disconnection/re-connection
[ ] Waku Relay state after disconnection/re-connection
- [ ] https://github.com/waku-org/js-waku/issues/1349
[ ] Usage of Waku Light Push during connection loss
[ ] Usage of Waku Store during connection loss
- [ ] Review possible automated actions upon re-connection (based on learnings from above)
- [ ] auto store query #913
- [ ] auto filter subscribe
- [x] https://github.com/waku-org/js-waku/issues/132
- [x] https://github.com/waku-org/js-waku/issues/1459

Notes

https://github.com/libp2p/js-libp2p/issues/744

RAID (Risks, Assumptions, Issues and Dependencies)

fryorcraken commented 1 year ago

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable

danisharora099 commented 1 year ago

Since this issue is a considerable overhaul, I'd like to be sure of my interpretation of the same.

I think the biggest one is:

* detect a loss of connection
For this, my interpretation is that libp2p's connection manager doesn't emit an event when a peer disconnects/streams are closed (on peer:disconnect) (ref: https://github.com/status-im/status-web/pull/288#discussion_r911672852) and that this will either be something we'll be handling ourselves (similar to https://github.com/status-im/status-web/pull/288 or https://github.com/xmtp/xmtp-js/pull/128) or an upstream change to libp2p (cont of https://github.com/libp2p/js-libp2p/issues/939 / https://github.com/libp2p/js-libp2p/issues/744) Please let me know if that's a fair interpretation.

Secondly, an explanation on the following, of how this fits with the initially described scope of the PR, will also be greatly appreciated.

Need to re-groom this issue to take peer discovery in account. Some draft notes:

    // createLightNode => 2 light push, 2 filter, 1 store
    // dns discovery: emits "bootstrap" nodes
    // catch "bootstrap" node -> peer exchange? connect to it
    // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
    // Already connected to a "bootstrap" node? new node, park it.
    // random order? probably at dns discovery level, because we know when we are "done".
    // Now we receive a `peer-exchange` node
    // -> Do we need this node? What capabilities does it have?
    // -> What ndes are we currently connected to?
    // -> connect, discard or save for later (add to peer store)
    // -> Do a filter query: make it lazy, no wait for remote peer
    // Peer management module: good defaults and configurable/swappable

Other than this, appreciate that the issue has been written very descriptively and will have a great impact!

cc @fryorcraken @felicio

fryorcraken commented 1 year ago

As mentioned. this is a considerable overhaul and would need to be tackled iteratively. Each iteration can increase the functionality/complexity of the manager while taking onboard feedback from dogfooding.

I suggest to focus the first iterations on using the peer discovery protocols dns discovery and peer exchange with the following logic:

when using createLightNode, apply default values of wanting 2 light push nodes, 2 filters nodes and 1 store node (user can override values).
DNS discovery bootstrap method is used, it discovers and emit nodes that are tagged as bootstrap
Connection Manager processes the bootstrap node i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic) ii. Use peer exchange to discover more nodes
Connection Manager discards/saves for later other bootstrap nodes
peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange
peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
connect to more peer-exchange node to fulfill requirements set in (1) if needed.

(3) introduces an issue: the first node discovery by DNS discovery will always be the same given a given enr tree. Hence we may need to introduce some pseudo random shuffling logic. this should probably be done in the connection manager.

The final result above can already be split in several tasks/PRs.

Then, we can add error resilience on top of it, I'd recommend in the following order:

Fail to dial a node: handle this and try another node (easy as all located in connection manager)
protocol failed on a node (ie, light push query failed): disconnect node and try other node: this is harder because the protocol needs to feedback the failure to the manager which can then handle the disconnect/connection. Also the protocol should know whether the error is worthy a disconnection. This is not really feasible now (no differentiations in errors) but may be with the waku store protocol overhaul).
Node disconnects: as you mentioned, we need to have feedback from libp2p for that so it may be more difficult.

danisharora099 commented 1 year ago

Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?

Tag the peer within the PeerStore

Dial

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)

why do we want to “later remove this logic”?

Connection Manager discards/saves for later other bootstrap nodes

How does the ConnectionManager do that? Do we mean the PeerStore/ AddressManager?

(later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)

peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange

peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)

“that’s our second node” — this step is skipped if the initial node requirements are met, correct?
the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

Fail to dial a node: handle this and try another node (easy as all located in connection manager)

how can we reproduce a non-dialable node?

in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with autoDial:true (which is used by default), all nodes are tried to connect to)

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:

discovery, tagging, dial attempts, etc are all handled implicitly

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: https://github.com/waku-org/js-waku/issues/1117)

Please let me know if I'm missing something

fryorcraken commented 1 year ago

Connection Manager processes the bootstrap node

We do two things with the peer, is that what we mean by "process" here?
* Tag the peer within the PeerStore

The tagging is done by the discovery service: https://github.com/waku-org/js-waku/blob/0b083201c6f585ad1b3267c1571df5f1b81a1355/packages/peer-exchange/src/waku_peer_exchange_discovery.ts#L92

* Dial

The dialling is done by the connection manager: https://github.com/waku-org/js-waku/blob/0b083201c6f585ad1b3267c1571df5f1b81a1355/packages/core/src/lib/waku.ts#L140

We will have to make this trivial logic more complex in the future and have a proper ConnectionManager class/module.

By "process" I mean deciding whether to dial and dialing.

i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)
* why do we want to “later remove this logic”?

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes. It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes.

Instead, I propose for the bootstrap node to only be used for bootstrapping, ie, getting access to the network. meaning only used for discv5 and peer-exchange.

Connection Manager discards/saves for later other bootstrap nodes
* How does the `ConnectionManager` do that? Do we mean the `PeerStore`/  `AddressManager`?

I don't know. The node will be tagged as bootstrap and store in peerStore so I guess that's good enough. Or maybe we'll have to also store the nodes in the ConnectionManager. We'll have to decide at implementation time if the peerStore gives us enough.

The ConnectManager is the new module that handles most of the logic described here.

* (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
peer exchange discover new nodes (using a bootstrap node), tags then as peer-exchange

peer-exchange node is discovered/emitted, connect to it (that's our second light push/filter node)
* “that’s our second node” — this step is skipped if the initial node requirements are met, correct?

Yes. For the rest of the logic I assume the proposed default requirements are applied.

* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)

We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

Fail to dial a node: handle this and try another node (easy as all located in connection manager)
* how can we reproduce a non-dialable node?

Pass an invalid multiaddr.

* in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with `autoDial:true` (which is used by default), all nodes are tried to connect to)

We'll have to review in details the retry logic when it's time to implement.

also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:
* discovery, tagging, dial attempts, etc are all handled implicitly

We should not dial every node we discover. This is why custom connection management is needed.

I'm not entirely sure what you mean by manually handling the above-described process

we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: #1117)

Please let me know if I'm missing something

danisharora099 commented 1 year ago

By "process" I mean deciding whether to dial and dialing.

How are we deducing whether to dial or not? If there's a peer available other than bootstrap, we give it more preference and dial to it instead of the bootstrap peer - correct? (to increase decentralisation)

Because if all js-waku nodes connects to bootstrap nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap nodes. It would also mean that the bootstrap nodes would need to be high performing node to serve all js-waku nodes. That makes sense! We should prioritize nodes found by discovery mechanisms other the bootstrapping. Noted.
* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)
We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.

We should not dial every node we discover. This is why custom connection management is needed.

My interpretation so far: Our connection priority should be something along the lines of:

connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes)
find more peers using other discovery mechanisms, via the bootstrap peer
connect to all new peers found until requirements are fulfilled

is that a fair interpretation?

danisharora099 commented 1 year ago

Also, as concluded from https://github.com/waku-org/js-waku/issues/1117#issuecomment-1376898390, libp2p by default autodials.

Considering the scope of this PR, it's best to default autoDial to false considering we don't want to connect to all discovered bootstrap nodes immediately.

fryorcraken commented 1 year ago

My interpretation so far: Our connection priority should be something along the lines of:
* connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes)

* find more peers using other discovery mechanisms, via the bootstrap peer

* connect to all new peers found until requirements are fulfilled
is that a fair interpretation?

Yes.

fryorcraken commented 1 year ago

As discussed with @danisharora099 the first step was actually a focus on connections to node discovered, Need to update description to record this first step.

Next, would be exploratory work to understand what information we can get when a node is disconnected.

fryorcraken commented 1 year ago

Blocked by https://github.com/waku-org/js-waku/issues/1412

fryorcraken commented 1 year ago

This milestones needs to be split. Some proposal:

First milestone is working under normal operation (10min reconneciton)
Second milestone could be investigation of report bugs/edge case and fixing them
Third milestone can contain "the rest" from this milestone
Peer exchange related work to be tracked in a peer exchange milestone

fryorcraken commented 1 year ago

edit: Please continue discussion below to https://github.com/waku-org/js-waku/issues/1463

Saving peers in local storage should be part of one of the milestone. However, it is not a trivial endeavour. Being able to connect to a peer is a good first step. However, it does not mean it's a quality peer. For example, what if this peer is not connected to the relay network and hence does not forward any message? From the moment we are moving away from the bootstrap fleet, we enter the decentralized domain and hence peers cannot be trusted, even if not malicious. I am guessing the solution is some sort of peer scoring done for light protocol.

For example, local node connects to Alice and Bob and does the same filter subscription for both:

Alice's score goes up when Alice push a message via fliter
Alice's score goes up when Alice is the first one to forward a message (ie, Bob forward the message after)
Bob's score goes down when he does not forward a message that Alice has forwarded within 5s

Then, for light push:

use Alice for light push
use Bob (and Carole for filter)

Send a message via Alice light push, message not ever received from Bob and Carole -> decrease Alice score and use David to re-send message

This needs to formalized with help from @waku-org/research and probably part of 2.1 https://github.com/waku-org/research/issues/3

danisharora099 commented 1 year ago

referring from https://github.com/waku-org/js-waku/issues/1412,

upon further investigation and taking into context previous findings, it's safe to conclude:

connection:close: monitor a permanent connection close between the local & remote node (which should usually be triggered along with peer:disconnect for cases where we only have one connection with the remote node)

peer:disconnect: monitor permanent disconnections with peers (this would imply not that no the underlying connection(s) have been permanently closed, and the only way to communicate with this peer again is to open a new connection/reconnect)

pings will fail when there are temporary network degradations or reachability issues. this does not mean that the underlying connection has been closed.

to address part of the acceptance criteria from this milestone, the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

as we concluded with https://github.com/waku-org/js-waku/issues/1403#issuecomment-1664205504, js-waku redials automatically after the 10 mins mark, which is not a conscious disconnection and requires reconnection so perhaps the above question is enforced by default already?

cc @fryorcraken

danisharora099 commented 1 year ago

Weekly Update

achieved: investigated & closed #1412
next: look into addressing deliberate vs accidental disconnections

danisharora099 commented 1 year ago

the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)

for this, we want to simply attempt reconnections

For library consumers,

For Filter,

we should open a subscription, and also send recurring SUBSCRIBE_PING to the node to ensure we're maintaing active subscriptions
- if we don't have an active subscription, we should reinitiate a subscription with the node
in case of reconnection, SUBSCRIBER_PING should be used to check for active subscription on the service nodes' end
[x] write explicit tests to cover this hypothetical as much as possible https://github.com/waku-org/docs.waku.org/issues/104#issuecomment-1759113733
[x] add something like manage your filter subscriptions to advanced js-waku docs https://github.com/waku-org/docs.waku.org/issues/104#issuecomment-1759113733

danisharora099 commented 1 year ago

Weekly Update

achieved: reached a conclusion tackling deliberate vs accidental disconnections, PRs opened to handle Filter subscriptions on disconnection/reconnections, iterative fixes on addressing multiple dial attempts for same peer, fixes around keep alive pings
next: getting reviews & merging these PRs which should enable us to close this epic 🥳

danisharora099 commented 1 year ago

Weekly Update

achieved: The Connection and Disconnection Peer Management epic has been closed