Meta Leader Placement - Githubissues

ColinSullivan1 commented 1 year ago

Feature Request

It'd be great to have the ability to place (or prefer) the meta-leader on a specific node or constrained to a set of nodes.

Use Case:

This would be an advanced use case and not recommended for typical deployments.

Mixed deployments in terms of machine profiles or preferred locations (e.g. multi cloud deployments)
Debugging/triage (the meta leader needs to be moved to a specific machine or subset of machines)

Proposed Change:

Allow specifying a tag for a meta-leader placement, along with the associated cli change. This may be difficult with RAFT.

Who Benefits From The Change(s)?

See use case.

Alternative Approaches

The current workaround is to move the meta leader multiple times until the desired node is assigned.

derekcollison commented 1 year ago

We do allow cluster placement today.

nats server raft step-down -h                                                                                                                                                               
usage: nats server raft step-down [<flags>]

Force a new leader election by standing down the current meta leader

Flags:
  --cluster=CLUSTER  Request placement of the leader in a specific cluster

ColinSullivan1 commented 1 year ago

Yep... was thinking a particular server or set of servers.

jleni commented 1 year ago

Let's say I have a cluster with three nodes and they cannot agree yet on a metaleader. How can I force nats-0 to be the leader instead of waiting indefinitely?

Example:

[97] 2023/06/13 21:14:03.053224 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:14:13.052716 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:14:23.053264 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:14:33.052999 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:14:43.053062 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:14:46.979552 [INF] JetStream cluster no metadata leader
[97] 2023/06/13 21:14:53.053818 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:15:03.053050 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:15:03.470356 [WRN] JetStream has not established contact with a meta leader
[97] 2023/06/13 21:15:13.053336 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:15:22.335674 [INF] JetStream cluster no metadata leader
[97] 2023/06/13 21:15:23.053171 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:15:33.053000 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:15:43.052466 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:15:49.582172 [INF] JetStream cluster no metadata leader
[97] 2023/06/13 21:15:53.052286 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:16:03.052552 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:16:13.052606 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:16:23.052654 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[97] 2023/06/13 21:16:33.052723 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"

derekcollison commented 1 year ago

If they can not agree it means either the cluster is malformed or mis-configured, or the peer set is actually larger then the cluster size.

We see this when folks accidentally add in other clusters or peers accidentally and turn them off but do not remove them from the JetStream cluster itself through peer-remove.

Once a cluster is healthy, and /healthz returns ok, the upgrade process on the latest helm charts will make sure to not move to the next peer until the last upgraded is back up, operational and reporting ok from /healthz.

nats-io / nats-server

Meta Leader Placement #3721

Feature Request

Use Case:

Proposed Change:

Who Benefits From The Change(s)?

Alternative Approaches