Stop sending periodic updates

LeeSmet commented 5 months ago

Currently the protocol is a (subset) of the babel routing protocol. The only noteworthy deviation is the fact that router ID's are 40 bytes, and contain a public key (which is used to verify updates). We also include periodic sending of updates as per the babel rfc.

The problem

Since periodic updates for selected routes get advertised to every peer, and most peers in the network currently connect to all public routes, the relationship between the amount of updates received by public nodes and the amount of nodes in the network is generally quadratic(*). With a 1 minute update interval, 1000 nodes equate to 1 million updates per minute, or roughly 16.6K per second which have to be processed. This puts a lot of strain on the process. At the same time, if the network is stable, these updates are actually pretty redundant. It should be noted that we anticipated this problem when starting the project, as it is obvious that periodically announcing updates increases protocol network traffic as the network grows. Regardless, we did implement this, as the periodic updates do ease development. If a trigger for an update is not implemented, the potential issue causes is minor, since a periodic update will announce the route in a reasonable timely fashion.

(*): There is an optimization to not send updates to peers if the peer is the next hop of the selected route for a subnet. Regardless, since only 1 node is omitted per subnet per peer, and there are multiple public peers, most nodes still receive the updates.

Reasons for periodic updates

While the RFC does not really explain why periodic updates are important to the protocol, there are a few things which need to be considered. First of all, the original use for babel is to communicate primarily over multicast UDP, and sometimes unicast UDP. Both of these are inherently protocols which make no guarantees about delivery of messages. While the concept of an "acknowledgment request" exists, this does not guarantee delivery of a message. Conversely, mycelium currently implements connections to peer as unicast, on top of ordered protocols. As such, we are guaranteed that messages will arrive at peers, and there is a form of integrity protection in the protocols. We aren't guaranteed that the peer will process the message, but it is generally feasible to assume that peers who receive messages will also process them, and if they end up discarding them the peer misbehaves (it could happen that a peer discards a message it doesn't understand, for instance if a new protocol message is defined and the peer is on an earlier version. If the message is integral to the protocol, this constitutes a breaking change, in which case it is allowed for peers on different major version to not interoperate). It should be noted that section 3.3 of the RFC explicitly mentions that nodes could rely on periodic updates as a means to not send acknowledgement requests. Thus, this section implies that periodic updates are mainly a fallback to unreliable transports. Note that we only talk about protocol traffic here, not actual data. It is still allowed to use unreliable protocols for data, since loss of data is handled by the overlay if needed. In this regard, #144 is feasible, for instance quic connections could be reworked to send data over udp datagrams while using quic streams for protocol traffic.

Routes also have a hold time, which is calculated from the update interval in the update message. This allows us to clear dead routes from the route table, since those will not be announced anymore. Once again, the route hold time seems to be related to unreliable transports. Since we have detection mechanisms for peers which have died, we can reasonably clean up dead routes from the network (since every node retracts routes which have the dead peer as next hop). The feasiblity condition should prevent loops from forming here, allowing cleaup.

On top of this, periodic updates do convey minor metric fluctuations. Triggered updates should only happen on large metric fluctuations, but a continues stream of small fluctuations would not be caused by this. Additionally, care needs to be taken, since peer link cost is implemented to start higher, and slowly converge to the actual value.

Changes needed

To avoid periodic updates, we need to also disable route hold times. To prevent accidental leaks, we can maintain a "last updated" field on route entries. If a route hasn't been updated in some time (order of hours), we can send a seqno request to refresh the route. This will only happen in fully stable networks. Additionally, we need to send updates for routes as peers along the route converge. This means we would still have quite some protocol traffic initially, but this will subside over time.

Other todo's

We should also check how other routing protocols handle this, such as OSPF

iwanbk commented 5 months ago

We also include periodic sending of updates as per the babel rfc.

ooh, i also noticed this one when reading Babel RFC. Because periodic work like this might become an issue on Phone's battery life.

But Jan said something that i concluded that mycelium doesn't implement it.

@LeeSmet how frequent and how big the periodic updates in mobile phone? Is it linear to the number of public peers it specified when starting mycelium?

LeeSmet commented 5 months ago

Currently it's enabled, but I did mention to Jan that I would remove it as specified here (which might cause the confusion). The amount of updates sent is linear in both thr amount of connected peers, and size of the network. Updates are sent every minute. Note that the issue is mostly in the public nodes (for now).

threefoldtech / mycelium