The existing (*Memberlist).Join method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.
For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a (*Memberlist).Join cannot be gracefully shut down until Join returns.
Proposal
Add context.Context argument to (*Memberlist).Join and check it between pushPulling with each node.
Alternatively, if you don't want to break existing client, we can create a new method JoinContext which does the above.
I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.
Description
The existing
(*Memberlist).Join
method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a
(*Memberlist).Join
cannot be gracefully shut down untilJoin
returns.Proposal
Add
context.Context
argument to(*Memberlist).Join
and check it betweenpushPull
ing with each node.Alternatively, if you don't want to break existing client, we can create a new method
JoinContext
which does the above.I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.