Suggestions and future lines

Not an issue per se, but I have been reading the paper and using armiarma and have some questions/suggestions that I would like to make:

Suggestions

I can see that armiarma uses rumor, but the code is copied in this repo and its difficult to differentiate whats new (i.e. which code belongs to armiarma and which one belongs to rumor). On top of that, maintaining a repo like this can be challenging, as new changes made in rumor will have to be manually rebased in this repo. I would suggest to use rumor as an external dependancy, so that we don't have duplicated code. So just building on top of rumor, without modifying its code. This will help making armiarma more maintainable over time.
Maybe I'm missing something but isn't all the code in armiarma/src/metrics/export/ boilerplate? I think we can remove most of the files and rethink a bit the MetricsDataFrame struct. Is it really needed? It contains the same information as PeerMetrics but with a list doesn't it?
I think it would be a nice feature to be able to run multiple instances of the crawler, and having all them reporting to the same endpoint that merges the data. By doing this we won't be biased anymore by having a single crawler (i.e. due to its location). In the end, its impossible to know 100% the status of the network, so the best estimation we can get is by randomly sampling it (i.e. having multiple crawlers with some diversity: location,...)
Would be nice to add a proper CLI with some specific flags to armiarma, so that we don't have to rely on an external .sh to call it.

Questions:

Based on the paper, I understand that its only possible to get the user_agent field (a.k.a client type) after successfully connecting to the peer. This means that if the peer doesn't have the ports open (which is not recommended but works), we won't be able to dial that peer and hence we can't know the client type. This massively bias the client diversity estimation.
- Have you thought about a way of solving this?
- If we have an inbound connection (a client connects to us) will we get its user_agent information?
- Its a pity that all clients (but prysm) use the same port 9000 ref. Otherwise we could get a rough estimation on the client type by the port number.
I'm not sure I fully understand this. As I see it we have two sources of peer data that can be useful for us:
- From peer discovery we can get the peer id, ip and port. It does not provide much information but we get this without actually being connected to the peer. So if a peer can't be connected (i.e. their ports are closed) we still have its id, and ip, which we can use to derive the location. Which information do you find relevant from this? Perhaps the most important is the peer location?
- After connecting to a peer: Here we get much more information about the peer, and we also get the PubSub messages that we can store to analyse later. Mind summarising the relevant information that you expect to get from each topic/message and how to use it?
Do you plan to maintain this project? I have found some references to "Kumo". Will it replace armiarma?
Just wondering, what was the rationale behind using rumor? Before knowing of this tool, I thought about modifying the prysm beacon chain, to be able to export the peer information. I haven't done much research but it should be possible to do it. Do you see any advantage in running a crawler that is actually in sync and serving data to other peers? As far as I know armiarma tricks the network by claiming to be at genesis.

Thanks for sharing such a cool project!

Hey Alvaro, thank you very much for the feedback and the suggestions!

Let's go through the mentioned points:

Suggestions:

I can see that armiarma uses rumor, but the code is copied in this repo and its difficult to differentiate whats new (i.e. which code belongs to armiarma and which one belongs to rumor). On top of that, maintaining a repo like this can be challenging, as new changes made in rumor will have to be manually rebased in this repo. I would suggest to use rumor as an external dependancy, so that we don't have duplicated code. So just building on top of rumor, without modifying its code. This will help making armiarma more maintainable over time.

Maybe I'm missing something but isn't all the code in armiarma/src/metrics/export/ boilerplate? I think we can remove most of the files and rethink a bit the MetricsDataFrame struct. Is it really needed? It contains the same information as PeerMetrics but with a list doesn't it?

Would be nice to add a proper CLI with some specific flags to armiarma, so that we don't have to rely on an external .sh to call it.

The current repo of Armiarma compiles the alpha version of the tool. At the beginning of the project, all the Rumor changes were made in a Fork of the official Rumor repo. Once the modifications and the modules increased, we decided to compile everything in a single repo. However, you are right! The tool could simplify the code and the dependencies by adding Rumor as an external library. We could maybe go back to the previous fork that I mentioned before to keep the modifications. Although, Rumor has been since last October without receiving an update and other alternatives are still open.

The first approach of the tool was prioritizing the results, testing if the idea could consolidate, and checking the project's potential. So, from now on, we should definitely improve the code organization.

I think it would be a nice feature to be able to run multiple instances of the crawler, and having all them reporting to the same endpoint that merges the data. By doing this we won't be biased anymore by having a single crawler (i.e. due to its location). In the end, its impossible to know 100% the status of the network, so the best estimation we can get is by randomly sampling it (i.e. having multiple crawlers with some diversity: location,...)

Related to one of the questions you did, the project is the same as the KUMO one. Through that collaboration with the ONTOCHAIN project, we manage to receive the support of the Iexec distributed platform. As a result of this collaboration, we thought about developing a distributed crawling system that would provide a better representation of the network and open the door to new studies. This is something we will be working on in the incoming weeks.

Questions:

Based on the paper, I understand that its only possible to get the user_agent field (a.k.a client type) after successfully connecting to the peer. This means that if the peer doesn't have the ports open (which is not recommended but works), we won't be able to dial that peer and hence we can't know the client type. This massively bias the client diversity estimation.

Have you thought about a way of solving this?

Yes. The fact that the tool can't actively connect some of the network peers doesn't imply that they can't contact us. We have seen that with the last stability updates, the number of connected peers has increased considerably. We have seen how the original 20% connection ratio has been increased to 50% of the peers in the Peerstore (around a total of 8.5k connected peers), from whom half of them were incoming connections.

If we have an inbound connection (a client connects to us) will we get its user_agent information?

The user_agent field corresponds to the peer metadata of the node, which in our later update is requested directly every time we have a new connection. So yes, even the incoming connections could be identified.

Its a pity that all clients (but prysm) use the same port 9000 ref. Otherwise we could get a rough estimation on the client type by the port number.

Regarding the port selection from the developer teams' side, it won't probably matter if the tool can be up running constantly. We have seen how the connection ratio increases proportionally with the time that the tool is up running. If the tool can successfully identify the vast majority of the peers, we could still get a close network representation. Even though, a different default port for each of the clients will definitely help us :)

From peer discovery we can get the peer id, ip and port. It does not provide much information but we get this without actually being connected to the peer. So if a peer can't be connected (i.e. their ports are closed) we still have its id, and ip, which we can use to derive the location. Which information do you find relevant from this? Perhaps the most important is the peer location?

You are right. Whenever a peer creates its Node Record or ENR it has to include the necessary networking info to help other nodes reach it (ID, IP, and Ports). As you said, we could display the entire geographical distribution from the received IPs. However, peers still need to be aware of their public IP when advertising it. Still, the accuracy of the distribution relies on the number of peers the tool can crawl.

After connecting to a peer: Here we get much more information about the peer, and we also get the PubSub messages that we can store to analyse later. Mind summarising the relevant information that you expect to get from each topic/message and how to use it?

Keeping track of the number of messages that each node forwards could anticipate unhealthy behaviors from peers (doesn't necessarily mean that they have dishonest intentions). The committees of each of the slots are predefined and known from the network. This means that we should expect one block + X number of attestations for each of the slots. If a peer somehow is broadcasting more than the necessary messages could imply that something happened to that node. Furthermore, the database would serve as a baseline to analyze the message propagation in the network. Analyzing the arrival time of each of the messages could predict when a given node is suffering unexpected behaviors that could end up being slashings on the hosted validators.

Do you plan to maintain this project? I have found some references to "Kumo". Will it replace armiarma?

The project will keep going on :) We have received pretty great support and feedback from the community, and we are willing to keep on with the project. The participation of the tool in projects such as ONTOCHAIN is just a method to finance the work. In any case, the code and the obtained results will be accessible to de community. The team has an academic purpose, and keeping everything open-source is one of our policies.

Just wondering, what was the rationale behind using rumor? Before knowing of this tool, I thought about modifying the prysm beacon chain, to be able to export the peer information. I haven't done much research but it should be possible to do it. Do you see any advantage in running a crawler that is actually in sync and serving data to other peers? As far as I know armiarma tricks the network by claiming to be at genesis.

The first reason we decided to use Rumor was its flexibility to develop test cases and the "simplicity" when creating them. In comparison with an entire functional client, Rumor is simpler to modify, lighter, and modular, which is much easier to maintain. At the moment, the tool indeed claims to be at the genesis of the chain, but we don't see a clear disadvantage in doing it. The crawler actively contributes to the message distribution and so far is enough to be dialed from peers.

We don't discard anyways to add some chain follow functionalities (we attempted to communicate the tool with a local Prysm node, and the results didn't significantly improve the obtained results). Still, at this point, we would prefer to focus on more interesting new network studies.

I hope that the long reply answers the suggestions and questions. Feel free anyways to ask or propose whatever.

To summarize, I would like to remember that the project wants to contribute to the healthy development and maintenance of the Eth2 network. Therefore, future needs, comments, and problems(hopefully not) will determine the evolution of the tool. In any case, every feedback and help is more than welcome!

migalabs / armiarma

Suggestions and future lines #10

Suggestions

Questions: