Feature: Support multiple crawlers

migalabs / armiarma

Armiarma is a Libp2p open-network crawler with a current focus on Ethereum's CL network

https://monitoreth.io

MIT License

37 stars 13 forks source link

Feature: Support multiple crawlers #12

Closed alrevuelta closed 1 year ago

alrevuelta commented 3 years ago

Opening this issue to discuss both the requirements and possible implementation of a new feature that allows to run multiple crawlers.

Motivation:

The number of beacon nodes is in the order of thousands, hence a single crawler is not enough to "sample" the whole network.
Having just one crawler has a huge bias, since its peers depend on the location.
A single crawler might be able to sample the whole network after some time, but the first sampled peers might not be active anymore at some point. By having multiple crawlers we can sample the network faster, and discard old peers if they are not seen in some time.

Requirements

The crawler, from now on armiarma-client shall be able to report its metrics to a remote endpoint aka armiarma-server.
Multiple armiarma-client shall be able to connect at the same time with a single armiarma-server.
The armiarma-server shall process the incoming peer information and merge it.
The armiarma-server shall expose its metrics to prometheus, just as the armiarma-client does, using the same interface and reusing its code.
Any monitoring tool (such as Grafana) or a frontend shall be able to connect to both a single crawler armiarma-client or to the aggregated metrics armiarma-server in the same exact way.

A naive architecture to have as a starting point:

Tasks

[ ] Implement armiarma-server exposing a gRPC service. #13
[ ] Connect armiarma-client to armiarma-server, sending peer information. #14
[ ] Refactor the code that currently exports the metrics to prometheus and make it reusable by armiarma-server #15
[ ] Add cli flags to armiarma-server to identify the crawler

leobago commented 3 years ago

Thank you @alrevuelta. Great points!

At this point, we already have the dashboard showing multiple different crawler outputs. Ideally, we would like to also show the aggregate of all crawlers together, as you mention in your post. So, users can choose whether they want to see the data from the crawler running in Frankfurt, or the one running in Sidney, or the aggregate of all the crawlers together.

Sharing data between the crawlers to tune their interaction with the rest of the network is also a great idea, I think that can come on a second step.

alrevuelta commented 3 years ago

At this point, we already have the dashboard showing multiple different crawler outputs. Ideally, we would like to also show the aggregate of all crawlers together, as you mention in your post. So, users can choose whether they want to see the data from the crawler running in Frankfurt, or the one running in Sidney, or the aggregate of all the crawlers together.

I wasn't actually thinking about the possibility of accessing each crawler metrics independently, but its a very good idea. The main issue that we need to address imho is that we can't just merge the metrics "blindly". I mean, we can't just merge the current metrics that we have, as they don't register the peer id. If we did so, we could end up merging metrics from the same peer but from different crawlers. So we should someway get the whole peer information from the crawler, and send it to another entity/sw/module that processes the aggregated information.

I updated the architecture above, let me know what you think. Some quick notes:

Crawler Client: It's the current armiarma software, which is of course connected to eth2 with a minor modification. Every time a new peer is detected, it sends its raw information to the Crawler Server. Not sure which interface we can use, but perhaps gRPC can be a good one. The "Prometheus Exporter" is what we currently have in armiarma, but I'm just pointing out that its a module that we reuse also in the server.
Crawler Server: It's not connected to eth2, but all crawler clients connect to it and report the new peers they find. Its only functionality is to gather the data from all peers and calculate the metrics, that are exposed to prometheus. The only different is that the metrics they expose are aggregated. Most of the "Prometheus Exporter" functionality can be reused.
Dashboard: The dashboard can connect to individual crawlers (as you were saying) or also to the Crawler Server, which provides aggregated metrics.

alrevuelta commented 3 years ago

Ideas for the gRPC API exposed by the Crawler Server:

DiscoveredNewPeer: A new peer was discovered. Input: TBD: Peer id? ENR? Output: Empty.
ConnectedToPeer: The crawler successfully connected to a new peer. Input: TBD: Peer? ENR? Resulting data from the connection.
DisconnectedPeer: A given peer got disconnected from the crawler. Note that we can end up having inconsistencies if a message is lost, as we can end up recording a peer connection and not a disconnection, even if it was disconnected. Rethink.
NewPeerMetadata: A peer was identified (its metadata was fetched successfully). Note that there are two types of metadata? Investigate

We can focus first in the IdentifiedPeer function to start with the integration of the client and server.

Ideas:

Would be good to know if a given peer is synched to the latest head or not. That would imply that that peer is very healthy. Investigate how to know it.
As discussed with @Cortze we will focus on having a client-server approach as a first iteration, but with the possibility of having a distributed crawler in the future. Since in a p2p network each peer is the client and server, if we decide to have it distributed we can reuse most of the software, embedding the Crawler Server in each Crawler Client. However, there are still some challenges. We may need to integrate a peer discovery service and use pub/sub for sharing the peer information, which may be overkill and complex. We decide to stick to a simple approach as a first iteration.
This new multi-crawler feature will respect the existing interfaces with the dashboard, so that the same dashboard can monitor a given crawler metrics or the aggregated ones.
Try to reuse the existing metrics exporter to "Prometheus Exporter", so that it can be used both in the client and the server. Perhaps defining a common interface.

cortze commented 3 years ago

As discussed with @alrevuelta , the current working line starts establishing the standard gRPC message structure that will take place between the Crawler and the Server.

Notes about the proposed message structure:

I would add a first crawler identification message for when the crawler has to approach the server (not sure if the crawler validation needs to be done through a gRPC, but the crawler should transmit its basic information).
- NewCrawlerIndetification: including the crawler_id, the location, launching_time, further crawler metadata. However, every message should include the crawler_id to link the message to a given crawler in the server
I would also change the ConnectedToPeer to ConnectionAttempt add some connection metadata, to classify the reachability of the peer.
Regarding the DisconnectedPeer, I'm having my doubts about what happens If the connection suddenly gets interrupted and the crawler can't notify the server of this peer disconnection. Perhaps we have to re-think the strategy to follow. Could be interesting to aggregate some periodic Metadata requests for the server to the crawler to see if there is anything new to update regarding that peer, opening a possible discussion?

About the proposed ideas, I like the idea of adding the Prometheus export into the metrics module, so that the PeerMetrics struct is the baseline of the export. This way, we achieve the common interface you talked about. It would also imply the unification of the PeerMetrics a storaging unit for each peer info. This will also mean to restructure the /metrics folder, removing duplicated/outdated code and generating an agnostic module (usable by the crawler and the server)

alrevuelta commented 3 years ago

I would add a first crawler identification message for when the crawler has to approach the server (not sure if the crawler validation needs to be done through a gRPC, but the crawler should transmit its basic information).

I agree that the crawler has to identify itself somehow, so that we know the origin of the peers information, but I think this should be done periodically as part of the data that we send with the rpc call. I mean, imho we should include the crawler information (such as id, location public key) in each message that is sent. The main pro I see is:

We avoid keeping state in the server. If we have a separate message to inform about the crawler, we would need to keep that state in the server, and then "link" the following messages to that crawler.

Whereas the main con:

We will end up sending redundant information, but since its just three fields, it should be a problem.

About the proposed ideas, I like the idea of adding the Prometheus export into the metrics module, so that the PeerMetrics struct is the baseline of the export. This way, we achieve the common interface you talked about. It would also imply the unification of the PeerMetrics a storaging unit for each peer info. This will also mean to restructure the /metrics folder, removing duplicated/outdated code and generating an agnostic module (usable by the crawler and the server)

Totally agree, we are on the same page here. As you are saying, metrics folder should contain the code that could be reused by the armiarma-client and armiarma-server, sharing a common interface. Then the data can be feed via rumor (in the client case) or via multiple crawlers (in the server case)

cortze commented 1 year ago

After a while, having some crawlers located around the earth reported little difference in the network's perspective. Although the idea is excellent, it seems much more straightforward to aggregate the metrics of the Postgresql database if something more specific wants to be checked. Closing the issue as there is no plan to work on this direction