hashicorp / memberlist

Golang package for gossip based membership and failure detection
Mozilla Public License 2.0
3.61k stars 435 forks source link

Fix loss of message after invalidating previous message in queue with single broadcast message #263

Open pstibrany opened 2 years ago

pstibrany commented 2 years ago

We have found a condition where message can get overwritten in the TransmitLimitedQueue. This PR fixes it.

Message can be lost in scenario where queue has a single message in it, and then another message, which invalidates previous message, is added to the queue. During invalidation the old message is removed and the queue is temporarily empty. At that point internal idGen value was reset to 0. However new message is then added to the queue with id computed from previous value of idGen.

If later more messages are added to the queue, new message can overwrite existing message in the queue, because of conflicting id and exact same length. See the unit test for exact scenario we have observed in our system.

Simplest fix seems to be to avoid resetting the idGen when queue is empty. Alternative fix may adjust id of inserted message if idGen was reset during invalidation of other messages in (*TransmitLimitedQueue).queueBroadcast, or pass a flag to (*TransmitLimitedQueue).deleteItem to indicate if reset of idGen should be done or not.

hashicorp-cla commented 2 years ago

CLA assistant check
All committers have signed the CLA.

pracucci commented 2 years ago

A clarification: the id is used for comparsion by limitedBroadcast.Less(). As the comment says:

// If !a.Less(b) && !b.Less(a), we treat this to mean a == b (i.e. we can only
// hold one of either a or b in the tree).

Given two items, if none of the two is less than the other, then they're considered equal by the btree implementation, and the btree.ReplaceOrInsert() (called by TransmitLimitedQueue.addItem()) will replace the previous one (because considered equal).