Open notnoop opened 3 years ago
I'd very surprised if anyone relied on this behavior
Our systems are relying on this exact behavior. Which seems to become broken once the TLS is enabled (which is a different story, but I found this issue searching for any bug reports or solutions to this).
When starting a new cluster, nomad servers can rely on Consul for service discovery to avoid requiring operators to set ip addresses. If Consul is federating multiple clusters, Nomad queries the local DC first; if none is found, it queries other Consul DCs in a random order. This behavior can be very surprising and lead to unexpected isolated clusters to join.
Note: Consul datacenter typically represents a single cloud region, and maps closer to Nomad's region concept.
Consider a user that runs Consul and Nomad installation in Ohio,
us-east-2
, and plans to expand to Mumbai,ap-south-1
, and setups Consul federation. When Mumbai's first Nomad servers starts up, it discovers Ohio cluster and joins them! Other Mumbai's servers will in sequence discover their peer and again join Ohio cluster. The newly created Ohio-Mumbai raft membership will have expanded quorum and suffer long latencies. Splitting the resulting cluster in half may result in loss of quorum and service disruption.This is extremely surprising! The logic dates back to the original PR in PR 1276, and provides no context on the choice.
The behavior is less pronounced in "real" life, though none is perfect in preventing the issue: Nomad uses local/private network address to communicate so cross-region packets get dropped, customers having network/firewall isolation blocking Nomad admin ports, Nomad clusters using TLS certs created by other regions.
Next Steps
Nomad should consider changing the logic so Nomad only queries local DC. The change is not backward compatible, though I'd very surprised if anyone relied on this behavior. We would welcome user info here.
Nomad users should consider specifying unique Nomad's region names (e.g. instead of
global
default), or specifying unique Consulserver_service_name
for each Nomad cluster.