Waku v2 monitoring node

jm-clius commented 1 year ago

Problem

We need a node that can be deployed to a specific Waku v2 network that can be used to gather various network metrics and monitor the overall health of the network. Note that this should not be confused with the "Waku v2 canary service" which will function more like a tool that can be used for checking the health of specific nodes on an ad-hoc basis.

Suggestion

This node could, for example,

"walk" the discv5 DHT in the network to get an idea of the number of nodes participating in the network
try to get an idea of the number of unique peer IDs seen in the network (using various discovery methods)
keep track of the number of messages per:
- protocol
- content topic
- application (first part of content topic)

Tracking:

[x] metrics on discovered and connected peers: https://github.com/status-im/nwaku/pull/1290
[x] metrics on messages (application, topic, etc): #1335
[ ] run an instance of networkmonitor exposing its data in a grafana dashboard.
- [x] https://github.com/status-im/infra-misc/issues/104#issuecomment-1315622948
- [ ] https://github.com/status-im/nwaku/pull/1401

jm-clius commented 1 year ago

@D4nte I think this node is closer to what you had in mind in this comment: https://github.com/status-im/nwaku/issues/754#issuecomment-1161443548

alrevuelta commented 1 year ago

@jm-clius This is something that interests me and I think it would be a great way of starting with both nim and libp2p, perhaps using nim-libp2p. I've done similar work in the Ethereum consensus layer with armiarma. This project:

Discovers peers using discv5, estimating the total peers the network has.
Pings each peer attempting a connection in order to fetch some metadata (mainly the UserAgent) to see:
- Which client software (prysm, nimbus, teku, lighthouse) the node it running.
- Build version and OS (usually part of the UserAgent)
- If the node is synced to the latest head
The crawler also listens to all pub-sub messages in the network (topics: beacon_block, voluntary_exit, etc)

jm-clius commented 1 year ago

@alrevuelta, indeed! With the operator trial having just launched, the value that a monitoring node will add has grown significantly. A basic monitoring node is indeed a good next step after the basic canary service to estimate at least two things as a start: estimate network size using discv5 and estimate the number of active applications (using contentTopic prefix) on the default network. Liveness checks, a way of determining client type, version, etc. would be a very valuable bonus!

alrevuelta commented 1 year ago

Related project nebula-crawler, similar to armiarma

alrevuelta commented 1 year ago

@jm-clius any opinion on:

whether to use WakuNode or directly libp2p and discoveryv5? The former option allows less flexibility so I'm leaning toward the later.

jm-clius commented 1 year ago

I don't have strong feelings about this, other than to keep it very simple as a start. Although we should imagine the types of questions we may want to ask in future, our immediate need is just to monitor two aspects:

estimate network size (discv5 based as a start, but we may supplement with other discovery methods)
count number of applications using the network

While I can imagine for (1) network size estimation with discv5 could built separately on discv5 (note the protocol ID change for Waku), once we start adding more discovery methods we may need an integration point similar to a WakuNode. For (2) I can barely think of another way to do this properly other than a (simplified) node participating in the network, subscribing to the default pubsub topic and counting the number of content topics/applications.

We should also foresee that we may want to monitor more and more application-specific network metrics as we go along, which seems to me to be easiest to trigger from a node that's an active participant in the network.

alrevuelta commented 1 year ago

@jm-clius Regarding using nwaku or directly discv5/libp2p, we both agreed that nwaku may be more suitable. However, I'm having second thoughts regarding using WakuDiscoveryV5 for peer discovery.

The main reason is that we may have to go lower level for peer discovery, using directly discv5 primitives. I guess WakuDiscoveryV5 is not optimized for finding all peers in the network, but a random subset of them.

I guess that in a network of a few hundred it won't make much of a difference, but if we reach 1k-5k I'm not sure we will discover them all. Perhaps spamming findRandomPeers is enough, but changing the k param (aka BUCKET_SIZE) might be interesting.

But perhaps I'm getting ahead of myself here.

jm-clius commented 1 year ago

Indeed, whereas most monitoring we want to do in future makes more sense from an application perspective, building a discv5 crawler of some kind may indeed require a more precise, separate service (i.e. the monitoring node could spin up this service too, but it may be a new service built directly on discv5).

However, I don't want the initial chunk of this work to be too complex - we just need a rough idea of how the current network, which is very small, is growing and we may get away by just using findRandomPeers() for now (which should eventually give us an approximate answer with some confidence).

@kaiserd wdyt? There may be a simple technical solution I'm missing.

kaiserd commented 1 year ago

To avoid confusion regarding the term WakuDiscoveryV5:

33/WAKU2-DISCV5 is our discv5 spec for Waku, which uses a different protocol-id. You would have to adhere to this spec in order to discover Waku nodes.

However, I assume, you were refering to waku_discv5.nim, which adds some functions useful for Waku on nim-eth/discv5.

For your implementation, you would not be limited to waku_discv5.nim. You are free to use nim-eth/discv5 directly. Our version in vendor/nim-eth is based on a feature branch that already supports 33/WAKU2-DISCV5.

Imo, the cleanest way to implement this would be a new type of node. This node would only mount protocols necessary for this purpose and feature the additional metrics gathering and logging. A good starting point would be copying the current node and removing everything that is not necessary for now. As @jm-clius pointed out, we do not know what we want to measure in the future, so it would be good for this node to work on the Waku v2 layer (which allows it to look into Waku protocols as well as deeper into the stack, too).

alrevuelta commented 1 year ago

@kaiserd Great info thanks! Some followup questions:

Had no idea we had a different protocol-id. Can't we open a PR to nim-eth and make it configurable so we avoid forking it? Or am I missing something?
We have wakuv2.prod and wakuv2.test fleets, but how are they segregated? Can't find any flag ./build/wakunode2 --help that allows for deployments in different networks + the ENRs look the same (apart from the keys). So how are both networks segregated? In ethereum consensus layer they use fork-id and fork-digest for this matter, which is encoded in the ENR. They use this to segregate mainnet from testnets. And actually I think that gnosis chain uses the same network with a different fork-id/digest. But to the best of my knowledge the protocol-id is the same.
Understand the needle-in-a-haystack issue and both 33/WAKU2-DISCV5 and discv5 are great writeups but would like to go deeper. I naively thought that there was some mechanism to filter out nodes that you were not interested in, so that you don't store them in your dht. If this were the case, we wouldn't have the needle-in-a-haystack right? So according to what you say, ethereum mainnet nodes are discovering testnet nodes and even gnosis nodes?

jm-clius commented 1 year ago

I can maybe help answer some of these questions:

We don't fork nim-eth. The protocol-id is set as a compile-time definition.
We do not separate the DHTs between different fleets. The different fleets can be (and should eventually be) part of the same network - there's no strong reason to keep them separated, although we could come up with ways to do it if need be.

alrevuelta commented 1 year ago

We don't fork nim-eth. The protocol-id is set as a compile-time definition.

Right, let me change "forking" to "branching". As I can see we use a branch selectable-protocol-id that has to be rebased to master (i.e. https://github.com/status-im/nwaku/pull/1276) when we want to bump nim-eth. Can't that change be part of master?

We do not separate the DHTs between different fleets. The different fleets can be (and should eventually be) part of the same network - there's no strong reason to keep them separated, although we could come up with ways to do it if need be.

Interesting, I totally assumed they were completely different networks, like EthGoerli and EthMainnet. And since with RLN we are piggybacking on ethereum, I expected wakuv2.test to be using i.e. goerli smart contracts and wakuv2.prod mainnet smart contracts (when we eventually reach that point). Seems weird to me that they are part of the same network, since they should be imho different environments.

Anyway, this goes beyond this issue, but its great input because now I know which peers I should expect to discover.

kaiserd commented 1 year ago

I naively thought that there was some mechanism to filter out nodes that you were not interested in, so that you don't store them in your dht.

Filtering would boil down to random walk searching, if Waku nodes are rare within the whole network. There is new research regarding resilient but efficient topic discovery in discv5. The blog post Waku v2 Ambient Peer Discovery, which extends on the forum post, discusses this. (Currently, I focus on the anonymity track, but will go back to that research eventually. For now, we decided to wait on new results on discv5 research and stick a separate Waku network for now.)

Can't that change be part of master

At the moment, this is still somewhat "experimental", and it is not part of the Ethereum discv5 spec which nim-eth is following.

I totally assumed they were completely different networks,

While the Ethereum forkid can be transmitted in the discovery network, it does not actually fork the discovery layer. It forks the Ethereum layer. The protocol-id forks the discovery layer.

Regarding Waku, discovery and Relay are two separate overlay networks, too. Waku protocols (like libp2p protocols) feature a protocol ID. As long as nodes of different fleets have matching (fuzzy matching is possible) protocol IDs, they are interoperable.

The discovery layer is oblivious to Waku capabilities (for now), and there is only a single Waku discovery network. So, nodes of different fleets naturally discover each other as long as there is a common node in the discovery network. For instance, if you run your Waku node, and you add both a test-fleet and a prod-fleet node as bootstrap nodes, you would "connect" these fleets.

alrevuelta commented 1 year ago

@jm-clius I made some progress on the monitoring node #1290

1) "walk" the discv5 DHT in the network to get an idea of the number of nodes participating in the network 2) try to get an idea of the number of unique peer IDs seen in the network (using various discovery methods) 3) keep track of the number of messages per: protocol content topic application (first part of content topic)

So far 1) and 2) are covered with an extra extension. Beyond discovering peers the monitoring node also tries to connect to them, identifying their user-agent (i.e. nwaku, go-waku) and supported protocols (i.e. /vac/waku/relay/2.0.0). See more in the PR. Some of the metrics:

total number of discovered nodes
number of discovered nodes supporting each protocol (as advertised by ENR): lighpush,store,...
number of nodes that we could connect to
number of nodes with each user-agent (i.e. nwaku, go-waku, etc)
number of nodes supporting each protocol, after a ok connection
some extra metrics on ips, location, etc.

More on that in the PR, but wanted to discuss 3). Some thoughts:

With the current size of the network, it's ok to monitor these messages, but as we scale perhaps will become impossible, because it will require a single node to be aware of everything that is happening on the network and the resources cpu/mem/bandwidth can really skyrocket.
Any specific topics that we want to monitor? I mean, the network may be full of topics, so which ones do we want to monitor? afaik /waku/2/default-waku/proto is common, should we monitor that just one?
Can you elaborate more on "keep track of the number of messages per protocol"? Perhaps an example?
I'm trying to differentiate between the messages that are gossiped and the node can get, and the ones that are point-to-point and we can't get.
Will split this into 2 PRs, one for 1)2) and one for 3)

jm-clius commented 1 year ago

@alrevuelta, thanks for the great progress on the monitoring node! I will review (1) and (2) also in the WIP PR.

On (3):

if acting as a "normal" relay node, the node will automatically monitor the pubsubTopics that it's subscribed to. In general nodes subscribe only to the default pubsubTopic, namely /waku/2/default-waku/proto and relays all messages on that topic. This will be the focus of the monitoring node for now (though I see no reason to limit it to only one pubsubTopic, in case we introduce sharding in future).

as we scale perhaps will become impossible, because it will require a single node to be aware of everything

this may be true, but that would imply that we need to introduce pubsubTopic sharding for all nodes in the network. Currently any given node is aware of every relay message on the network (since we share a single pubsubTopic) and just counting content topics does not seem like a major extra requirement to me?

Can you elaborate more on "keep track of the number of messages per protocol"?

Actually I'm not sure now what I meant. A monitoring node will by definition only be aware of relay messages. Ignore this requirement for now - we are far more interested in counting the number of applications on the network. 😅

I'm trying to differentiate between the messages that are gossiped and the node can get, and the ones that are point-to-point and we can't get.

Exactly, though all point-to-point protocols by approximation require messages to be relayed (even if you interact directly with a service node using a request-response protocol it will either gossip a message on your behalf, or return gossiped messages).

Will split this into 2 PRs, one for 1)2) and one for 3)

Makes sense to me!

alrevuelta commented 1 year ago

As a followup on this, the last remaining task to close this would be to run the networkmonitor in our fleets and make all the data available via a grafana dashboard (with the already discussed restrictions).

jm-clius commented 1 year ago

@alrevuelta I'm moving this issue then to the next release milestone to reflect the outstanding task (an alternative would be to create a separate issue for that task and close this one, but that may be unnecessary admin :) )

waku-org / nwaku