migalabs / armiarma

Armiarma is a Libp2p open-network crawler with a current focus on Ethereum's CL network
https://monitoreth.io
MIT License
37 stars 13 forks source link

Feature: Support multiple crawlers #12

Closed alrevuelta closed 1 year ago

alrevuelta commented 3 years ago

Opening this issue to discuss both the requirements and possible implementation of a new feature that allows to run multiple crawlers.

Motivation:

Requirements

A naive architecture to have as a starting point: image

Tasks

leobago commented 3 years ago

Thank you @alrevuelta. Great points!

At this point, we already have the dashboard showing multiple different crawler outputs. Ideally, we would like to also show the aggregate of all crawlers together, as you mention in your post. So, users can choose whether they want to see the data from the crawler running in Frankfurt, or the one running in Sidney, or the aggregate of all the crawlers together.

Sharing data between the crawlers to tune their interaction with the rest of the network is also a great idea, I think that can come on a second step.

alrevuelta commented 3 years ago

At this point, we already have the dashboard showing multiple different crawler outputs. Ideally, we would like to also show the aggregate of all crawlers together, as you mention in your post. So, users can choose whether they want to see the data from the crawler running in Frankfurt, or the one running in Sidney, or the aggregate of all the crawlers together.

I wasn't actually thinking about the possibility of accessing each crawler metrics independently, but its a very good idea. The main issue that we need to address imho is that we can't just merge the metrics "blindly". I mean, we can't just merge the current metrics that we have, as they don't register the peer id. If we did so, we could end up merging metrics from the same peer but from different crawlers. So we should someway get the whole peer information from the crawler, and send it to another entity/sw/module that processes the aggregated information.

I updated the architecture above, let me know what you think. Some quick notes:

alrevuelta commented 3 years ago

Ideas for the gRPC API exposed by the Crawler Server:

We can focus first in the IdentifiedPeer function to start with the integration of the client and server.

Ideas:

cortze commented 3 years ago

As discussed with @alrevuelta , the current working line starts establishing the standard gRPC message structure that will take place between the Crawler and the Server.

Notes about the proposed message structure:

  1. I would add a first crawler identification message for when the crawler has to approach the server (not sure if the crawler validation needs to be done through a gRPC, but the crawler should transmit its basic information).

    • NewCrawlerIndetification: including the crawler_id, the location, launching_time, further crawler metadata. However, every message should include the crawler_id to link the message to a given crawler in the server
  2. I would also change the ConnectedToPeer to ConnectionAttempt add some connection metadata, to classify the reachability of the peer.

  3. Regarding the DisconnectedPeer, I'm having my doubts about what happens If the connection suddenly gets interrupted and the crawler can't notify the server of this peer disconnection. Perhaps we have to re-think the strategy to follow. Could be interesting to aggregate some periodic Metadata requests for the server to the crawler to see if there is anything new to update regarding that peer, opening a possible discussion?

About the proposed ideas, I like the idea of adding the Prometheus export into the metrics module, so that the PeerMetrics struct is the baseline of the export. This way, we achieve the common interface you talked about. It would also imply the unification of the PeerMetrics a storaging unit for each peer info. This will also mean to restructure the /metrics folder, removing duplicated/outdated code and generating an agnostic module (usable by the crawler and the server)

alrevuelta commented 3 years ago

I would add a first crawler identification message for when the crawler has to approach the server (not sure if the crawler validation needs to be done through a gRPC, but the crawler should transmit its basic information).

I agree that the crawler has to identify itself somehow, so that we know the origin of the peers information, but I think this should be done periodically as part of the data that we send with the rpc call. I mean, imho we should include the crawler information (such as id, location public key) in each message that is sent. The main pro I see is:

Whereas the main con:

About the proposed ideas, I like the idea of adding the Prometheus export into the metrics module, so that the PeerMetrics struct is the baseline of the export. This way, we achieve the common interface you talked about. It would also imply the unification of the PeerMetrics a storaging unit for each peer info. This will also mean to restructure the /metrics folder, removing duplicated/outdated code and generating an agnostic module (usable by the crawler and the server)

Totally agree, we are on the same page here. As you are saying, metrics folder should contain the code that could be reused by the armiarma-client and armiarma-server, sharing a common interface. Then the data can be feed via rumor (in the client case) or via multiple crawlers (in the server case)

cortze commented 1 year ago

After a while, having some crawlers located around the earth reported little difference in the network's perspective. Although the idea is excellent, it seems much more straightforward to aggregate the metrics of the Postgresql database if something more specific wants to be checked. Closing the issue as there is no plan to work on this direction