Optimize butterfly rumor traffic

raskchanky commented 5 years ago

Right now when a new supervisor joins a butterfly network, there's a storm of network traffic as existing supervisors share their rumors with the new member and the new member shares its newly received rumors back to everyone else.

@baumanj @christophermaier and I have had some discussions on this topic and written up thoughts in a separate document. I will quote some of that discussion here.

In our current implementation, we use a form of rumor mongering, but instead of considering individual rumors hot from the perspective of a specific member, all members keep a record of all rumors' hotness with respect to all other members. (It’s not clear why the implementation deviates from the published algorithm in this way.) This has the benefit of ensuring completeness, but a the cost of O(N²) complexity in space, time and network traffic. To make matters worse, despite transmitting rumors over a reliable protocol (ZMQ-based TCP), rumors transition from hot to not based on a deterministic transmission count (as opposed to probabilistic "coin") and this transition is blind to feedback from the recipient about whether the rumor was already known. This results in every rumor being sent to every other member, by every other member at least 3 times. This implies that at least 66% of all our rumor-based network traffic is unnecessary, and possibly considerably more.

This particular piece has gotten somewhat better with the introduction of the HAB_RUMOR_SHARE_LIMIT environment variable that allows users to tune how many times a rumor will be shared with a given member.

Our system contains several different kinds of rumors whose scope of interest is either the entire supervisor network or a particular service group: Service: network Departure: network ServiceConfig/ServiceFile: service group Election/ElectionUpdate: service group

Despite this, all rumors are currently distributed to all members of the supervisor network. Since this is generally wasteful, a natural optimization would be to distribute rumors to their minimal relevant group. However, while ServiceConfig and ServiceFile rumors are only of interest to service group members, since they are currently disseminated to the entire network, configuration rumors can be injected by non service group members before the service itself is loaded by any of the members. This sort of pre-seeded configuration is a use case in practice, but is somewhat counter-intuitive and likely the result of the fact that Service rumors advertising binds are distributed before services are completely up. We should investigate addressing this use case in a better way which allows more efficient distribution of service group only rumors.

It would be terrific if we had a way of optimizing network traffic so that only the relevant bits are sent. One easy win would be to only send rumors relevant to a service group to that specific group, instead of to the network as a whole.

Additionally, optimizations are possible to reduce network traffic in general. In the Xerox paper that discusses rumor mongering, the authors explain how when they switched to an active anti-entropy mechanism, they introduced the notion of exchanging hashes to cut down on network traffic. The idea here (as it relates to butterfly) would be for each peer to generate a hash that represents the sum total of all of the rumor stores (effectively butterfly's state) and exchange that hash with peers. If two peers exchange hashes and they match, then there's no need for any rumor exchange between those peers at all. Only if the hashes differ would there be a need for a rumor exchange.

There are probably other optimizations available as well. This issue should serve as a place to have those discussions before deciding on an overall direction.

baumanj commented 5 years ago

Another optimization that I've been thinking about is a mechanism for a new member to request a bulk update of all rumors (basically the .rst file of another member) when joining, rather than relying on the gossip mechanism to bring them up to speed.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

habitat-sh / habitat

Optimize butterfly rumor traffic #6675