hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.83k stars 1.95k forks source link

nomad does not register HTTP tag for server in Consul #23384

Open BrianHicks opened 3 months ago

BrianHicks commented 3 months ago

Nomad version

Nomad v1.8.0

Operating system and Environment details

NixOS 24.05 running on Hetzner cloud VMs.

Issue

When advertise.http is set, Nomad is not registering a http tag with Consul. rpc and serf are registered, though.

(This is blocking me from scraping job metrics with Prometheus.)

Reproduction steps

Run Nomad using this config:

{
  "acl": {
    "enabled": true
  },
  "advertise": {
    "http": "{{ GetInterfaceIP \"enp7s0\" }}",
    "rpc": "{{ GetInterfaceIP \"enp7s0\" }}",
    "serf": "{{ GetInterfaceIP \"enp7s0\" }}"
  },
  "consul": {
    "address": "127.0.0.1:8501",
    "ssl": true
  },
  "data_dir": "/var/lib/nomad",
  "datacenter": "us-east",
  "log_level": "TRACE",
  "ports": {
    "http": 4646,
    "rpc": 4647,
    "serf": 4648
  },
  "server": {
    "bootstrap_expect": 1,
    "enabled": true
  },
  "telemetry": {
    "collection_interval": "1s",
    "disable_hostname": true,
    "prometheus_metrics": true,
    "publish_allocation_metrics": true,
    "publish_node_metrics": true
  },
  "tls": {
    "ca_file": "[SNIP]",
    "cert_file": "[SNIP]",
    "http": true,
    "key_file": "[SNIP]",
    "rpc": true,
    "verify_https_client": false,
    "verify_server_hostname": true
  },
  "ui": {
    "enabled": true
  }
}

(Plus a side config I have not shared that sets consul.token.)

Expected Result

Nomad registers a nomad service with http, rpc, and serf tags.

Actual Result

Nomad only registers rpc and serf tags.

Nomad Server logs (if appropriate)

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.json, /etc/nomad.d/consul-token.json
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.1.0:4646; RPC: 10.0.1.0:4647; Serf: 10.0.1.0:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: false
             Log Level: INFO
               Node Id: 60100119-2101-5fe3-1fc7-887d6a5dab36
                Region: global (DC: us-east)
                Server: true
               Version: 1.8.0
==> Nomad agent started! Log data will stream in below:
    2024-06-19T05:42:49.734Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-06-19T05:42:49.736Z [INFO]  nomad.raft: starting restore from snapshot: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159
    2024-06-19T05:42:49.760Z [INFO]  nomad.raft: snapshot restore progress: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159 read-bytes=298159 percent-complete="100.00%"
    2024-06-19T05:42:49.760Z [INFO]  nomad.raft: restored from snapshot: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159
    2024-06-19T05:42:49.770Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:35e115c2-34da-f3ba-8579-e8e122ba3dfd Address:10.0.1.0:4647}]"
    2024-06-19T05:42:49.771Z [INFO]  nomad: serf: EventMemberJoin: leader-red.global 10.0.1.0
    2024-06-19T05:42:49.771Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-19T05:42:49.771Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-19T05:42:49.773Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.1.0:4647 [Follower]" leader-address= leader-id=
    2024-06-19T05:42:49.774Z [WARN]  nomad: serf: Failed to re-join any previously known node
    2024-06-19T05:42:49.774Z [INFO]  nomad: adding server: server="leader-red.global (Addr: 10.0.1.0:4647) (DC: us-east)"
    2024-06-19T05:42:51.014Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-06-19T05:42:51.014Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.1.0:4647 [Candidate]" term=32
    2024-06-19T05:42:51.016Z [INFO]  nomad.raft: election won: term=32 tally=1
    2024-06-19T05:42:51.016Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.1.0:4647 [Leader]"
    2024-06-19T05:42:51.016Z [INFO]  nomad: cluster leadership acquired
    2024-06-19T05:42:51.055Z [INFO]  nomad: eval broker status modified: paused=false
    2024-06-19T05:42:51.055Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-06-19T05:42:51.055Z [INFO]  nomad: revoking consul accessors after becoming leader: accessors=14
tgross commented 3 months ago

Hi @BrianHicks! I wasn't able to reproduce what you're seeing on either 1.8.0 or the current tip of main. I also played around with HCL vs JSON configuration and wasn't able to see a difference there either. The weird thing about this is that we create and register those services all at the same time: agent.go#L961-L1009

If you were to run the server with log_level = "debug" you'd see a message during startup about syncing to Consul like the one below. What's that look like?

2024-06-21T15:47:51.388-0400 [DEBUG] consul.sync: sync complete: registered_services=3 deregistered_services=0 registered_checks=3 deregistered_checks=0

Also, if you run the following command against one of the servers, what's the response body look like?

nomad operator api '/v1/agent/self' | jq '.config.Consuls'
BrianHicks commented 3 months ago

How interesting! I don't see any such message when running in debug; here's the output:

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.json, /run/agenix/nomad-consul-token.json
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.1.0:4646; RPC: 10.0.1.0:4647; Serf: 10.0.1.0:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: false
             Log Level: DEBUG
               Node Id: 2f248988-a9b9-265f-6f60-ff48eeb337d7
                Region: global (DC: us-east)
                Server: true
               Version: 1.8.0
==> Nomad agent started! Log data will stream in below:
    2024-06-22T00:25:24.879Z [DEBUG] nomad: issuer not set; OIDC Discovery endpoint for workload identities disabled
    2024-06-22T00:25:24.884Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-06-22T00:25:24.886Z [INFO]  nomad.raft: starting restore from snapshot: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574
    2024-06-22T00:25:24.910Z [INFO]  nomad.raft: snapshot restore progress: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574 read-bytes=276574 percent-complete="100.00%"
    2024-06-22T00:25:24.911Z [INFO]  nomad.raft: restored from snapshot: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574
    2024-06-22T00:25:24.911Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:35e115c2-34da-f3ba-8579-e8e122ba3dfd Address:10.0.1.0:4647}]"
    2024-06-22T00:25:24.912Z [INFO]  nomad: serf: EventMemberJoin: leader-red.global 10.0.1.0
    2024-06-22T00:25:24.912Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-22T00:25:24.912Z [DEBUG] nomad: started scheduling worker: id=4b6681a1-493b-d115-a9df-076f99145c65 index=1 of=2
    2024-06-22T00:25:24.912Z [DEBUG] nomad: started scheduling worker: id=22ce07f8-09a0-8b34-913b-af8585da9491 index=2 of=2
    2024-06-22T00:25:24.912Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-22T00:25:24.912Z [DEBUG] http: UI is enabled
    2024-06-22T00:25:24.913Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.1.0:4647 [Follower]" leader-address= leader-id=
    2024-06-22T00:25:24.914Z [WARN]  nomad: serf: Failed to re-join any previously known node
    2024-06-22T00:25:24.914Z [DEBUG] worker: running: worker_id=4b6681a1-493b-d115-a9df-076f99145c65
    2024-06-22T00:25:24.914Z [DEBUG] worker: running: worker_id=22ce07f8-09a0-8b34-913b-af8585da9491
    2024-06-22T00:25:24.914Z [INFO]  nomad: adding server: server="leader-red.global (Addr: 10.0.1.0:4647) (DC: us-east)"
    2024-06-22T00:25:24.914Z [DEBUG] nomad.keyring.replicator: starting encryption key replication
    2024-06-22T00:25:26.507Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-06-22T00:25:26.507Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.1.0:4647 [Candidate]" term=36
    2024-06-22T00:25:26.508Z [DEBUG] nomad.raft: voting for self: term=36 id=35e115c2-34da-f3ba-8579-e8e122ba3dfd
    2024-06-22T00:25:26.510Z [DEBUG] nomad.raft: calculated votes needed: needed=1 term=36
    2024-06-22T00:25:26.510Z [DEBUG] nomad.raft: vote granted: from=35e115c2-34da-f3ba-8579-e8e122ba3dfd term=36 tally=1
    2024-06-22T00:25:26.510Z [INFO]  nomad.raft: election won: term=36 tally=1
    2024-06-22T00:25:26.510Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.1.0:4647 [Leader]"
    2024-06-22T00:25:26.510Z [INFO]  nomad: cluster leadership acquired
    2024-06-22T00:25:26.518Z [INFO]  nomad: eval broker status modified: paused=false
    2024-06-22T00:25:26.518Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-06-22T00:25:26.518Z [DEBUG] nomad.autopilot: autopilot is now running
    2024-06-22T00:25:26.518Z [DEBUG] nomad.autopilot: state update routine is now running
    2024-06-22T00:25:26.518Z [INFO]  nomad: revoking consul accessors after becoming leader: accessors=14

And here's the output of the command:

[
  {
    "Addr": "127.0.0.1:8501",
    "AllowUnauthenticated": true,
    "Auth": "",
    "AutoAdvertise": true,
    "CAFile": "",
    "CertFile": "",
    "ChecksUseAdvertise": false,
    "ClientAutoJoin": true,
    "ClientFailuresBeforeCritical": 0,
    "ClientFailuresBeforeWarning": 0,
    "ClientHTTPCheckName": "Nomad Client HTTP Check",
    "ClientServiceName": "nomad-client",
    "EnableSSL": true,
    "GRPCAddr": "",
    "GRPCCAFile": "",
    "KeyFile": "",
    "Name": "default",
    "Namespace": "",
    "ServerAutoJoin": true,
    "ServerFailuresBeforeCritical": 0,
    "ServerFailuresBeforeWarning": 0,
    "ServerHTTPCheckName": "Nomad Server HTTP Check",
    "ServerRPCCheckName": "Nomad Server RPC Check",
    "ServerSerfCheckName": "Nomad Server Serf Check",
    "ServerServiceName": "nomad",
    "ServiceIdentity": null,
    "ServiceIdentityAuthMethod": "nomad-workloads",
    "ShareSSL": null,
    "Tags": null,
    "TaskIdentity": null,
    "TaskIdentityAuthMethod": "nomad-workloads",
    "Timeout": 5000000000,
    "Token": "<redacted>",
    "VerifySSL": true
  }
]
tgross commented 3 months ago

Thanks @BrianHicks. I see that your Consul configuration doesn't have a CAFile or CertFile, but that you're connecting to Consul on port 8501 which is the Consul default port for https. Is there any chance it's just the wrong port, so Nomad can't find Consul at all?

I wouldn't expect to see any tags in that case, of course, but maybe the local agent has a cached version floating around from an earlier config?