hashicorp / memberlist

Golang package for gossip based membership and failure detection
Mozilla Public License 2.0
3.61k stars 435 forks source link

`Join` with context cancelation #291

Open dimitarvdimitrov opened 11 months ago

dimitarvdimitrov commented 11 months ago

Description

The existing (*Memberlist).Join method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.

For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a (*Memberlist).Join cannot be gracefully shut down until Join returns.

Proposal

Add context.Context argument to (*Memberlist).Join and check it between pushPulling with each node.

Alternatively, if you don't want to break existing client, we can create a new method JoinContext which does the above.

I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.