paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Feature Request: Local multicast reserved peer discovery #12601

Closed senseless closed 2 years ago

senseless commented 2 years ago

Context: We're working on deploying public RPC services for all relay networks and common goods parachains. In an attempt to reduce our overall bandwidth consumption we've begun deploying --reserved-nodes flag with a reduction for in/out peers per node. We were able to successfully reduce our average bandwidth utilization from 10Mbit/s to 3Mbit/s per node a 70% decrease in bandwidth utilization.

Problem: Attempting to maintain these reserved nodes list on a per-node basis does not scale well even with ansible or similar automation taking care of some of the tasks. If we add a new node to the network we have to restart every node on the network for them to start communicating with the new peer.

Solution: What would be ideal is to make use of multicast in a many-to-many configuration for local peer discovery. We vision a system where you would start the node with a --multicast-reserved flag. This would enable multicast and make multicast discovered peers considered as 'reserved peers' (meaning, these nodes are added in addition to any other peers from normal peer discovery methods and do not count against in/out peers). When a node is discovered, the node's type kusama, polkadot, etc is determined (ideally sent in the multicast datagram) and connections with appropriate / similar systems are maintained. Each node should send regular interval (10-60 seconds) multicast discovery packets. The nodes would still use existing unicast systems to communicate and transfer data, this is purely for node discovery not for any sort of data transmission.

The result is that we can reduce the number of --in-peers and --out-peers per node and rely on the multicast system for our local nodes to find and begin communicating with each other without massive unmaintainable reserved-nodes lists. We would have a local network that is in aggregate better connected than a single node with a higher in/out peers count and just as easy to maintain. The reduction in bandwidth will allow us to have lower costs and charges associated with our public rpc services.

You can see the following RFCs for information about multicast and software inclusion: https://datatracker.ietf.org/doc/html/rfc1112 https://datatracker.ietf.org/doc/html/rfc3170

The existing mdns system would probably work with appropriate igmp implementation to span multiple subnets as long as it was bi-directional and all peers consider each other reserved.

Note: This will make it much easier for operators to launch sentinel protected nodes by enabling this feature and setting their in/out peers to 0. A backend "protected" node would then only communicate with local nodes and it would be up to those other local nodes to provide connectivity to the protected node.

senseless commented 2 years ago

As a follow up,

With the addition of --allow-private-ipv4-reserved flag which would be the same as --allow-private-ipv4 except making these peers reserved, would work. The best solution for us would be a --mdns-reserved / --multicast-reserved flag that would allow all multicast discovered peers to be considered reserved. There are environments that use non-private ipv4 spacing where reserved tag on multicast discovered peers is desirable.

bkchr commented 2 years ago

TDLR, RPC are not the future.

If you want to reduce the bandwidth, you need to reduce the number of peers you are connected to. As you already said, this can be achieved with the --in-peers and --out-peers. So, I don't see any requirement to have local multicast reserved peer discovery.

senseless commented 2 years ago

TDLR, RPC are not the future.

If you want to reduce the bandwidth, you need to reduce the number of peers you are connected to. As you already said, this can be achieved with the --in-peers and --out-peers. So, I don't see any requirement to have local multicast reserved peer discovery.

I can't control the quality of external peers. If one node is only connected to dialup peers in Antarctica, that node would be rendered effectively dead. The reduction in in/out peers opens the possibility of a sybil attack. It's ideal for my nodes to share their peers internally by communicating any block updates received amongst themselves. It allows me to reduce in/out peers for an individual node without sacrificing quality through an aggregate number of external connections. This can only be accomplished if the nodes internally see themselves as reserved connections -- where maintaining those connections does not count against in/out peers.

If RPC is not the future, you should seriously consider who will be maintaining the full archive of the blockchain. Currently only RPC nodes have a need/incentive for the full archive. Everyone else is incentivized to prune to reduce disk usage.

bkchr commented 2 years ago

If RPC is not the future, you should seriously consider who will be maintaining the full archive of the blockchain. Currently only RPC nodes have a need/incentive for the full archive. Everyone else is incentivized to prune to reduce disk usage.

For sure, however that isn't the topic of this issue. You could also ask, is it really required to have the entire history of the chain? Probably not.

I can't control the quality of external peers. If one node is only connected to dialup peers in Antarctica, that node would be rendered effectively dead

For sure you can not control the "quality", but this is the reason you are connected to multiple nodes and not just one. The possbility to be only connected to dialup (whatever dialup means here) nodes in Antarctica should be around 0% (depending on the number of maximum peers).

If you want to build up your special networking topology, you can already do it with the provided tools a la reserved nodes etc. I honestly also don't see any problem. If you control your topology through some scripts, adding a new node shouldn't be hard.

senseless commented 2 years ago

For sure, however that isn't the topic of this issue. You could also ask, is it really required to have the entire history of the chain? Probably not.

There are a lot of reasons that the entire history of the blockchain needs to be preserved. Taxes is one.

For sure you can not control the "quality", but this is the reason you are connected to multiple nodes and not just one. The possbility to be only connected to dialup (whatever dialup means here) nodes in Antarctica should be around 0% (depending on the number of maximum peers).

If you want to build up your special networking topology, you can already do it with the provided tools a la reserved nodes etc. I honestly also don't see any problem. If you control your topology through some scripts, adding a new node shouldn't be hard.

With 10 in/out peers or lower I see a degradation in connectivity. My PV round election scores go down from 0 missed votes to a few 100 missed votes or more. As I said in my original post, I would need to restart every single node on my network any time I add a new node. That's not scalable. I'll be running a few 100 nodes per relay both baked into parachain clients and as standalone relaychain clients. I'll see if I can figure out a hack I guess. You seem uninterested. Meanwhile, treasury could be paying 60-70% less for infrastructure services as compared to what they're paying now while not sacrificing quality.

rphmeier commented 2 years ago

Hi @senseless

It is interesting that you experience such a severe reduction in traffic.

As a matter of process and scope, I am going to transfer this issue to https://github.com/paritytech/substrate as the changes you describe are Substrate-wide and don't affect the code here at all, except perhaps the CLI.

With respect to the issue you are facing, I notice in your problem statement that you say:

If we add a new node to the network we have to restart every node on the network for them to start communicating with the new peer.

It may be that the system_reservedPeers, system_addReservedPeer, and system_removeReservedPeer RPC calls allow you to automate these changes with more traditional devops tooling. I certainly agree that having to restart all nodes when adding to a cluster is not ideal.

That said, Substrate is already a very large and complex piece of software, and introducing deeply specialized multicast logic will require deep changes across many areas of the stack, and I don't know if there is enough need to justify baking in this functionality. If the RPCs mentioned above are insufficient for task when combined with standard tools and a few custom scripts, then it'd be better to find another minimal and reusable API which could be added to the RPC interface as opposed to getting into the weeds of libp2p, sc-network, and CLI logic.

dcolley commented 2 years ago

For some context - this network will host public RPC servers.

The api call is documented here: https://polkadot.js.org/docs/substrate/rpc#addreservedpeerpeer-text-text addReservedPeer is not authenticated. If the RPC external is enabled, it can be spammed.

I was able to add a peer to my own server with polkadot.js image

For this node I have enable unsafe as it's an internal network

  --unsafe-rpc-external \
  --rpc-methods=Unsafe \

For parity rpc=wss://kusama-rpc.polkadot.io, it failed image

So, we could script the addReservedPeer from localhost on the node

senseless commented 2 years ago

It should be doable with ansible and some simple scripting on the host. I didn't realize that RPC call was there. If it's too complex of a change, don't worry about it, we can proceed with the ansible + rpc solution.

Thanks @rphmeier @dcolley

It is interesting that you experience such a severe reduction in traffic.

My goal is to get it down to 5 in + 5 out externally per node. At those levels the peak transfer I was seeing when sending/receiving new blocks was around 3-4Mbit. This in contrast to 20 + 20 peers where baseline levels were typically in the 5-6mbit range with spikes up to and over 10mbit when sending / receiving blocks. Going from 40 peers to 10 peers is a ~75% reduction in external peers. It would make sense for the bandwidth to drop similarly. If we can figure out an easily scalable low maintenance method to make our nodes speak with each other we can reduce costs on the treasury for future rpc services pretty significantly. I don't expect it will be an issue to deploy a devops solution with rpc but a flag that would do this automagically would be amazing.

rphmeier commented 2 years ago

@senseless @dcolley - great to hear that the RPC approach works well enough. I will close this issue now as 'wontfix', simply because the code changes are so involved & introduce a large maintenance burden. Please don't hesitate to open a new issue if there are improvements to the RPC layer that can be introduced to make peer-set management easier from external processes.

ehsanhajian commented 2 years ago

I think you're talking about one datacenter. because for multicast support, igmp protocol needs to be enabled on all network devices. in addition to that, the security concerns for igmp is very high. if someone spoof your multicast IP or if someone send igmp flooding, that would be terrible to manage. igmp protocol maybe good for a managed network but it's very risky.