Closed fryorcraken closed 1 year ago
Need to re-groom this issue to take peer discovery in account. Some draft notes:
// createLightNode => 2 light push, 2 filter, 1 store
// dns discovery: emits "bootstrap" nodes
// catch "bootstrap" node -> peer exchange? connect to it
// How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR
// Already connected to a "bootstrap" node? new node, park it.
// random order? probably at dns discovery level, because we know when we are "done".
// Now we receive a `peer-exchange` node
// -> Do we need this node? What capabilities does it have?
// -> What ndes are we currently connected to?
// -> connect, discard or save for later (add to peer store)
// -> Do a filter query: make it lazy, no wait for remote peer
// Peer management module: good defaults and configurable/swappable
Since this issue is a considerable overhaul, I'd like to be sure of my interpretation of the same.
I think the biggest one is:
* detect a loss of connection
For this, my interpretation is that libp2p's connection manager doesn't emit an event when a peer disconnects/streams are closed (on
peer:disconnect
) (ref: https://github.com/status-im/status-web/pull/288#discussion_r911672852) and that this will either be something we'll be handling ourselves (similar to https://github.com/status-im/status-web/pull/288 or https://github.com/xmtp/xmtp-js/pull/128) or an upstream change to libp2p (cont of https://github.com/libp2p/js-libp2p/issues/939 / https://github.com/libp2p/js-libp2p/issues/744) Please let me know if that's a fair interpretation.
Secondly, an explanation on the following, of how this fits with the initially described scope of the PR, will also be greatly appreciated.
Need to re-groom this issue to take peer discovery in account. Some draft notes:
// createLightNode => 2 light push, 2 filter, 1 store // dns discovery: emits "bootstrap" nodes // catch "bootstrap" node -> peer exchange? connect to it // How do we know it has peer exchange? need to extend peerInfo to include `waku2` ENR field or full ENR // Already connected to a "bootstrap" node? new node, park it. // random order? probably at dns discovery level, because we know when we are "done". // Now we receive a `peer-exchange` node // -> Do we need this node? What capabilities does it have? // -> What ndes are we currently connected to? // -> connect, discard or save for later (add to peer store) // -> Do a filter query: make it lazy, no wait for remote peer // Peer management module: good defaults and configurable/swappable
Other than this, appreciate that the issue has been written very descriptively and will have a great impact!
cc @fryorcraken @felicio
As mentioned. this is a considerable overhaul and would need to be tackled iteratively. Each iteration can increase the functionality/complexity of the manager while taking onboard feedback from dogfooding.
I suggest to focus the first iterations on using the peer discovery protocols dns discovery and peer exchange with the following logic:
createLightNode
, apply default values of wanting 2 light push nodes, 2 filters nodes and 1 store node (user can override values).bootstrap
bootstrap
node
i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)
ii. Use peer exchange to discover more nodesbootstrap
nodespeer-exchange
peer-exchange
node is discovered/emitted, connect to it (that's our second light push/filter node)peer-exchange
node to fulfill requirements set in (1) if needed.(3) introduces an issue: the first node discovery by DNS discovery will always be the same given a given enr tree. Hence we may need to introduce some pseudo random shuffling logic. this should probably be done in the connection manager.
The final result above can already be split in several tasks/PRs.
Then, we can add error resilience on top of it, I'd recommend in the following order:
- Connection Manager processes the
bootstrap
nodeWe do two things with the peer, is that what we mean by "process" here?
- Tag the peer within the PeerStore
- Dial
i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)
- why do we want to “later remove this logic”?
- Connection Manager discards/saves for later other
bootstrap
nodes
- How does the
ConnectionManager
do that? Do we mean thePeerStore
/AddressManager
?- (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
- peer exchange discover new nodes (using a bootstrap node), tags then as
peer-exchange
peer-exchange
node is discovered/emitted, connect to it (that's our second light push/filter node)
- Fail to dial a node: handle this and try another node (easy as all located in connection manager)
- how can we reproduce a non-dialable node?
- in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with
autoDial:true
(which is used by default), all nodes are tried to connect to)
also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:
I'm not entirely sure what you mean by manually handling the above-described process
we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: https://github.com/waku-org/js-waku/issues/1117)
Please let me know if I'm missing something
- Connection Manager processes the
bootstrap
nodeWe do two things with the peer, is that what we mean by "process" here?
* Tag the peer within the PeerStore
The tagging is done by the discovery service: https://github.com/waku-org/js-waku/blob/0b083201c6f585ad1b3267c1571df5f1b81a1355/packages/peer-exchange/src/waku_peer_exchange_discovery.ts#L92
* Dial
The dialling is done by the connection manager: https://github.com/waku-org/js-waku/blob/0b083201c6f585ad1b3267c1571df5f1b81a1355/packages/core/src/lib/waku.ts#L140
We will have to make this trivial logic more complex in the future and have a proper ConnectionManager
class/module.
By "process" I mean deciding whether to dial and dialing.
i. Can connect to it and use it for lightpush/store/filter (we can do that now and later remove this logic)
* why do we want to “later remove this logic”?
Because if all js-waku nodes connects to bootstrap
nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use the bootstrap
nodes.
It would also mean that the bootstrap
nodes would need to be high performing node to serve all js-waku nodes.
Instead, I propose for the bootstrap
node to only be used for bootstrapping, ie, getting access to the network. meaning only used for discv5 and peer-exchange.
- Connection Manager discards/saves for later other
bootstrap
nodes* How does the `ConnectionManager` do that? Do we mean the `PeerStore`/ `AddressManager`?
I don't know. The node will be tagged as bootstrap
and store in peerStore
so I guess that's good enough.
Or maybe we'll have to also store the nodes in the ConnectionManager
. We'll have to decide at implementation time if the peerStore
gives us enough.
The ConnectManager
is the new module that handles most of the logic described here.
* (later, making a note: we might also have to think about the edge cases like the addresses changing for these saved peers in the interim)
- peer exchange discover new nodes (using a bootstrap node), tags then as
peer-exchange
peer-exchange
node is discovered/emitted, connect to it (that's our second light push/filter node)* “that’s our second node” — this step is skipped if the initial node requirements are met, correct?
Yes. For the rest of the logic I assume the proposed default requirements are applied.
* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)
We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.
- Fail to dial a node: handle this and try another node (easy as all located in connection manager)
* how can we reproduce a non-dialable node?
Pass an invalid multiaddr
.
* in the case the dial fails, the ConnectionManager/PeerStore should automatically ignore the node (from my understanding, with `autoDial:true` (which is used by default), all nodes are tried to connect to)
We'll have to review in details the retry logic when it's time to implement.
also, a lot of the process you describe, from my understanding, takes place implicitly by libp2p for eg:
* discovery, tagging, dial attempts, etc are all handled implicitly
We should not dial every node we discover. This is why custom connection management is needed.
I'm not entirely sure what you mean by manually handling the above-described process
we might not need to manually handle/connect to discovered peers as libp2p automatically dials to discovered peers and handles pruning if necessary (upper connection limit on ConnectionManager) — we should just focus on discovering the most number of peers (ref: #1117)
Please let me know if I'm missing something
By "process" I mean deciding whether to dial and dialing.
How are we deducing whether to dial or not? If there's a peer available other than bootstrap, we give it more preference and dial to it instead of the bootstrap peer - correct? (to increase decentralisation)
Because if all js-waku nodes connects to
bootstrap
nodes to be served (store/light push/filter), then the waku network will be centralized because it would only use thebootstrap
nodes. It would also mean that thebootstrap
nodes would need to be high performing node to serve all js-waku nodes. That makes sense! We should prioritize nodes found by discovery mechanisms other the bootstrapping. Noted.* the nodes are anyway connected to (unless ConnectionManager's limit is reached, in which case it will auto-prune)
We should automatically connect to all node we discover. Which is why we need a Connection Manager and why we need to implement this logic.
We should not dial every node we discover. This is why custom connection management is needed.
My interpretation so far: Our connection priority should be something along the lines of:
is that a fair interpretation?
Also, as concluded from https://github.com/waku-org/js-waku/issues/1117#issuecomment-1376898390, libp2p by default autodials.
Considering the scope of this PR, it's best to default autoDial
to false considering we don't want to connect to all discovered bootstrap nodes immediately.
My interpretation so far: Our connection priority should be something along the lines of:
* connect to only 1 bootstrap peer (as little as possible to avoid centralisation on bootstrap nodes) * find more peers using other discovery mechanisms, via the bootstrap peer * connect to all new peers found until requirements are fulfilled
is that a fair interpretation?
Yes.
As discussed with @danisharora099 the first step was actually a focus on connections to node discovered, Need to update description to record this first step.
Next, would be exploratory work to understand what information we can get when a node is disconnected.
This milestones needs to be split. Some proposal:
edit: Please continue discussion below to https://github.com/waku-org/js-waku/issues/1463
Saving peers in local storage should be part of one of the milestone. However, it is not a trivial endeavour. Being able to connect to a peer is a good first step. However, it does not mean it's a quality peer. For example, what if this peer is not connected to the relay network and hence does not forward any message? From the moment we are moving away from the bootstrap fleet, we enter the decentralized domain and hence peers cannot be trusted, even if not malicious. I am guessing the solution is some sort of peer scoring done for light protocol.
For example, local node connects to Alice and Bob and does the same filter subscription for both:
Then, for light push:
Send a message via Alice light push, message not ever received from Bob and Carole -> decrease Alice score and use David to re-send message
This needs to formalized with help from @waku-org/research and probably part of 2.1 https://github.com/waku-org/research/issues/3
referring from https://github.com/waku-org/js-waku/issues/1412,
upon further investigation and taking into context previous findings, it's safe to conclude:
connection:close
: monitor a permanent connection close between the local & remote node (which should usually be triggered along withpeer:disconnect
for cases where we only have one connection with the remote node)peer:disconnect
: monitor permanent disconnections with peers (this would imply not that no the underlying connection(s) have been permanently closed, and the only way to communicate with this peer again is to open a new connection/reconnect)- pings will fail when there are temporary network degradations or reachability issues. this does not mean that the underlying connection has been closed.
to address part of the acceptance criteria
from this milestone,
the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)
as we concluded with https://github.com/waku-org/js-waku/issues/1403#issuecomment-1664205504, js-waku redials automatically after the 10 mins mark, which is not a conscious disconnection and requires reconnection so perhaps the above question is enforced by default already?
cc @fryorcraken
Weekly Update
the principal question remains: how do we understand if the disconnection with the remote peer is deliberate, or simply due to network conditions (implying we want to initiate a reconnection)
for this, we want to simply attempt reconnections
For library consumers,
For Filter,
we should open a subscription, and also send recurring SUBSCRIBE_PING
to the node to ensure we're maintaing active subscriptions
in case of reconnection, SUBSCRIBER_PING
should be used to check for active subscription on the service nodes' end
[x] write explicit tests to cover this hypothetical as much as possible https://github.com/waku-org/docs.waku.org/issues/104#issuecomment-1759113733
[x] add something like manage your filter subscriptions
to advanced js-waku docs https://github.com/waku-org/docs.waku.org/issues/104#issuecomment-1759113733
Weekly Update
Weekly Update
Planned start date: Due date:
Summary
In a browser environment, loss of connectivity may happen.
When it happens, this would impede the application to receive and send messages.
Strategies need to be designed to:
waitForRemotePeer
)Proposed Solutions
The result of this issue would be a mix of:
At this stage it is not sure what should go in waku core and what should be a library helper.
Acceptance Criteria
Tasks
N
attemptsdial all available bootstrap peers if no PX peers are dialable after N secondsTasks Moved to Out of Scope
Notes
https://github.com/libp2p/js-libp2p/issues/744
RAID (Risks, Assumptions, Issues and Dependencies)