hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

Nomad Manual Clustering does not work without consul #19174

Closed gbarton closed 10 months ago

gbarton commented 10 months ago

Nomad version

Nomad v1.6.3 BuildDate 2023-10-30T12:58:10Z Revision e0497bff14378d68cad76a801cc0eba93ce05039

Operating system and Environment details

cat /etc/*-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian

Issue

I have been trying to follow the manual steps here to setup a cluster: https://developer.hashicorp.com/nomad/tutorials/manage-clusters/clustering

I have host .1 setup as a server and client, that works, comes up, does its thing. I have host .2 setup as a client, it is bootstrapped with retry_join of .1, and server enabled = false, yet it always overrides the retry_join and makes itself a server.

The thing thats special about the env is that its using netmaker for a meshed wireguard network at 2 different sites. I have tested and have full connectivity, and thats what the forced host_network block, advertise, and bind_addr settings are for.

The consul settings were from looking at other issues about nomad reverting back to consul on failed connect. They had no effect.

The eventual goal is to link many mobile sites up with nomad for local access to content/capabilities via a wireguard mesh network. 3 server/clients will be hosted at a main site, and clients will run in several others.

Any help is greatly appreciated!

Reproduction steps

Start a nomad with the following configuration on host 1:

server {
  enabled = true
  bootstrap_expect = 1
  #server_join {
  #  retry_join = ["10.232.232.1:4648"]
  #}
}
data_dir = "/nomad/data/"
bind_addr = "10.232.232.1"
advertise {
  http = "10.232.232.1"
  rpc = "10.232.232.1"
  serf = "10.232.232.1"
}
client {
  enabled = true
  host_network "netmaker" {
    interface = "netmaker"
  }
  server_join {
    retry_join = ["10.232.232.1:4647"]
  }
}
consul {
  client_auto_join = false
}

Start the following on host 2:

server {
  enabled = false
}
datacenter = "dc2"
data_dir = "/nomad/data/"
bind_addr = "10.232.232.3"
advertise {
  http = "10.232.232.3"
  rpc = "10.232.232.3"
  serf = "10.232.232.3"
}
client {
  enabled = true
  host_network "netmaker" {
    interface = "netmaker"
  }
  server_join {
    retry_join = ["10.232.232.1:4647"]
  }
}
consul {
  auto_advertise = false
  client_auto_join = false
  server_auto_join = false
}

Expected Result

Host 2 client joins host 1 server.

Actual Result

Host 2 immediately nopes and overrides retry_join and joins itself.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nothing happens in them.

Nomad Client logs (if appropriate)

Host 2:

==> Loaded configuration from /etc/nomad/local.json
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 10.232.232.3:4646; RPC: 10.232.232.3:4647; Serf: 10.232.232.3:4648
            Bind Addrs: HTTP: [10.232.232.3:4646]; RPC: 10.232.232.3:4647; Serf: 10.232.232.3:4648
                Client: true
             Log Level: DEBUG
               Node Id: 96651f93-aabc-8038-a47c-0175c86e75b2
                Region: global (DC: dc2)
                Server: true
               Version: 1.6.3

==> Nomad agent started! Log data will stream in below:

    2023-11-25T20:34:40.965Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:cedd12ed-f3cc-bbe3-c7b8-7462b2328b0a Address:10.232.232.3:4647}]"
    2023-11-25T20:34:40.965Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.232.232.3:4647 [Follower]" leader-address= leader-id=
    2023-11-25T20:34:40.967Z [INFO]  nomad: serf: EventMemberJoin: rpi3-trailer.global 10.232.232.3
    2023-11-25T20:34:40.968Z [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2023-11-25T20:34:40.968Z [DEBUG] nomad: started scheduling worker: id=5cced61e-8c3e-9207-57fd-24f7794d15a0 index=1 of=4
    2023-11-25T20:34:40.968Z [DEBUG] nomad: started scheduling worker: id=536389d8-d507-f461-70d4-d5cd8954a230 index=2 of=4
    2023-11-25T20:34:40.968Z [DEBUG] worker: running: worker_id=5cced61e-8c3e-9207-57fd-24f7794d15a0
    2023-11-25T20:34:40.969Z [DEBUG] nomad: started scheduling worker: id=991cd61e-bb62-70f1-8358-89a4642d22ba index=3 of=4
    2023-11-25T20:34:40.969Z [DEBUG] nomad: started scheduling worker: id=ec9cc2c5-b153-8783-ea8a-237749778ad3 index=4 of=4
    2023-11-25T20:34:40.969Z [INFO]  nomad: started scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2023-11-25T20:34:40.969Z [DEBUG] worker: running: worker_id=536389d8-d507-f461-70d4-d5cd8954a230
    2023-11-25T20:34:40.969Z [DEBUG] worker: running: worker_id=991cd61e-bb62-70f1-8358-89a4642d22ba
    2023-11-25T20:34:40.969Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nomad/data/plugins
    2023-11-25T20:34:40.969Z [DEBUG] worker: running: worker_id=ec9cc2c5-b153-8783-ea8a-237749778ad3
    2023-11-25T20:34:40.970Z [INFO]  nomad: adding server: server="rpi3-trailer.global (Addr: 10.232.232.3:4647) (DC: dc2)"
    2023-11-25T20:34:40.970Z [DEBUG] nomad.keyring.replicator: starting encryption key replication
    2023-11-25T20:34:40.979Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/nomad/data/plugins
    2023-11-25T20:34:40.981Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2023-11-25T20:34:40.981Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2023-11-25T20:34:40.981Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2023-11-25T20:34:40.981Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2023-11-25T20:34:40.981Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2023-11-25T20:34:40.982Z [ERROR] client.cpuset.v2: failed to enabled minimum set of cgroup controllers; disabling cpuset management: error="write /sys/fs/cgroup/cgroup.subtree_control: device or resource busy"
    2023-11-25T20:34:40.983Z [INFO]  client: using state directory: state_dir=/nomad/data/client
    2023-11-25T20:34:40.983Z [INFO]  client: using alloc directory: alloc_dir=/nomad/data/alloc
    2023-11-25T20:34:40.983Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2023-11-25T20:34:40.990Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_aws", "env_gce", "env_azure", "env_digitalocean"]
    2023-11-25T20:34:40.990Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2023-11-25T20:34:40.991Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
    2023-11-25T20:34:40.991Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2023-11-25T20:34:41.000Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
    2023-11-25T20:34:41.002Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU model: name=Cortex-A53
    2023-11-25T20:34:41.002Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU frequency: mhz=1400
    2023-11-25T20:34:41.002Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: EXTRA_VALUE_AT_END=4
    2023-11-25T20:34:41.002Z [WARN]  client.fingerprint_mgr.cpu: failed to detect set of reservable cores: error="openat2 /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective: no such file or directory"
    2023-11-25T20:34:41.004Z [WARN]  client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented"
    2023-11-25T20:34:41.006Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2023-11-25T20:34:41.007Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
    2023-11-25T20:34:41.007Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
    2023-11-25T20:34:41.007Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
    2023-11-25T20:34:41.008Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2023-11-25T20:34:41.009Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2023-11-25T20:34:41.016Z [DEBUG] client.fingerprint_mgr.network: unable to parse link speed: path=/sys/class/net/eth0/speed device=eth0
    2023-11-25T20:34:41.016Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=eth0 mbits=1000
    2023-11-25T20:34:41.017Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/wlan0/speed device=wlan0
    2023-11-25T20:34:41.017Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=wlan0 mbits=1000
    2023-11-25T20:34:41.030Z [DEBUG] client.fingerprint_mgr.network: unable to parse link speed: path=/sys/class/net/docker0/speed device=docker0
    2023-11-25T20:34:41.030Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=docker0 mbits=1000
    2023-11-25T20:34:41.050Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/netmaker/speed device=netmaker
    2023-11-25T20:34:41.050Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=netmaker mbits=1000
    2023-11-25T20:34:41.054Z [WARN]  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
    2023-11-25T20:34:41.062Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
    2023-11-25T20:34:41.068Z [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-11-25T20:34:41.068Z [DEBUG] client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
    2023-11-25T20:34:41.070Z [DEBUG] client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnvironment?api-version=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-11-25T20:34:41.072Z [DEBUG] client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-11-25T20:34:41.072Z [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cgroup", "cpu", "host", "network", "nomad", "signal", "storage"]
    2023-11-25T20:34:41.072Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2023-11-25T20:34:41.072Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2023-11-25T20:34:41.073Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2023-11-25T20:34:41.073Z [DEBUG] client.device_mgr: exiting since there are no device plugins
    2023-11-25T20:34:41.075Z [DEBUG] client.driver_mgr.docker: using client connection initialized from environment: driver=docker
    2023-11-25T20:34:41.077Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=undetected description=""
    2023-11-25T20:34:41.079Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
    2023-11-25T20:34:41.080Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
    2023-11-25T20:34:41.080Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
    2023-11-25T20:34:41.081Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2023-11-25T20:34:41.081Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2023-11-25T20:34:41.081Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2023-11-25T20:34:41.084Z [DEBUG] client.server_mgr: new server list: new_servers=[10.232.232.3:4647] old_servers=[]
    2023-11-25T20:34:41.188Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=healthy description=Healthy
    2023-11-25T20:34:41.188Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[healthy:[raw_exec exec docker] undetected:[java qemu]]"
    2023-11-25T20:34:41.189Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2023-11-25T20:34:41.189Z [INFO]  client: started client: node_id=8e3aed98-cd12-b353-ceeb-61f59f6de1b6
    2023-11-25T20:34:41.190Z [DEBUG] http: UI is enabled
    2023-11-25T20:34:41.190Z [DEBUG] http: UI is enabled
    2023-11-25T20:34:41.202Z [INFO]  agent.joiner: starting retry join: servers=10.232.232.1:4647
    2023-11-25T20:34:41.220Z [DEBUG] client.server_mgr: new server list: new_servers=[10.232.232.1:4647] old_servers=[10.232.232.3:4647]
    2023-11-25T20:34:41.220Z [INFO]  agent.joiner: retry join completed: initial_servers=1 agent_mode=client
    2023-11-25T20:34:42.817Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2023-11-25T20:34:42.817Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.232.232.3:4647 [Candidate]" term=2
    2023-11-25T20:34:42.818Z [DEBUG] nomad.raft: voting for self: term=2 id=cedd12ed-f3cc-bbe3-c7b8-7462b2328b0a
    2023-11-25T20:34:42.818Z [DEBUG] nomad.raft: calculated votes needed: needed=1 term=2
    2023-11-25T20:34:42.818Z [DEBUG] nomad.raft: vote granted: from=cedd12ed-f3cc-bbe3-c7b8-7462b2328b0a term=2 tally=1
    2023-11-25T20:34:42.818Z [INFO]  nomad.raft: election won: term=2 tally=1
    2023-11-25T20:34:42.818Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.232.232.3:4647 [Leader]"
    2023-11-25T20:34:42.820Z [INFO]  nomad: cluster leadership acquired
    2023-11-25T20:34:42.831Z [DEBUG] nomad.autopilot: autopilot is now running
    2023-11-25T20:34:42.831Z [DEBUG] nomad.autopilot: state update routine is now running
    2023-11-25T20:34:42.834Z [INFO]  nomad.core: established cluster id: cluster_id=a8b0c4de-8a86-58a5-8e44-50f202459dbc create_time=1700944482833432011
    2023-11-25T20:34:42.834Z [INFO]  nomad: eval broker status modified: paused=false
    2023-11-25T20:34:42.835Z [INFO]  nomad: blocked evals status modified: paused=false
    2023-11-25T20:34:42.853Z [INFO]  nomad.keyring: initialized keyring: id=6cd6c66c-91b3-9f61-0c7e-2734cf884d3f
    2023-11-25T20:34:42.895Z [DEBUG] client.server_mgr: new server list: new_servers=[10.232.232.3:4647] old_servers=[10.232.232.1:4647]
    2023-11-25T20:34:42.895Z [INFO]  client: node registration complete
    2023-11-25T20:34:42.898Z [DEBUG] client: updated allocations: index=1 total=0 pulled=0 filtered=0
    2023-11-25T20:34:42.899Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=0
    2023-11-25T20:34:42.899Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
    2023-11-25T20:34:42.903Z [DEBUG] client: state updated: node_status=ready
    2023-11-25T20:34:43.897Z [DEBUG] client: state changed, updating node and re-registering
    2023-11-25T20:34:43.902Z [INFO]  client: node registration complete
jrasell commented 10 months ago

Hi @gbarton and thanks for raising this issue.

I have not been able to reproduce this locally and taking a look at the logs you provided the agent is starting with the server mode enabled Server: true. It would therefore seem something is incorrect in the configuration file being loaded compared to the config you entered into the issue. There are other items such as the log level and datacenter name which are not listed within the host2 config but are showing as non-default when looking at the logs. Could you please double check the config file being loaded by the agent, and share the full file if possible?

gbarton commented 10 months ago

Thank you for your reply! I your hint about the config not being quite right was the key I needed. I was passing in hcl as a json file and it seemed to partially work for some very strange reason. As soon as I changed the file to .hcl, the config worked as expected.

Closing this as a non-issue, thank you!