Nomad registered service check fails without any output.

blmhemu commented 1 year ago

Overview of the Issue

Done nomad / consul integration as per https://developer.hashicorp.com/nomad/docs/integrations/consul-integration Seems like service is registered, but the health check is not working (for both nomad-server and client)

There also seems to be a stale check which is not being deregistered (This is a single server cluster - so it is not from another server)

Surprisingly, I cannot see any [Error] http check failed kinda logs - only the service registration logs.

Reproduction Steps

Consul ACL for nomad

    agent_prefix "" {
      policy = "read"
    }

    node_prefix "" {
      policy = "read"
    }

    service_prefix "" {
      policy = "write"
    }

    acl = "write" # Only for server not for client

Use CONSUL_HTTP_TOKEN env var in nomad and enable consul integration.

Tried changing advertising address / etc. in nomad, could not get it working.

Consul info for both Client and Server

Server info

``` agent: check_monitors = 0 check_ttls = 0 checks = 3 services = 3 build: prerelease = revision = 192df66a version = 1.16.0 version_metadata = consul: acl = enabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.1.1.1:8300 server = true raft: applied_index = 681355 commit_index = 681355 fsm_pending = 0 last_contact = 0 last_log_index = 681355 last_log_term = 218 last_snapshot_index = 671815 last_snapshot_term = 207 latest_configuration = [{Suffrage:Voter ID:31172a93-71b5-9e17-83b5-7bc8e550e51c Address:10.1.1.1:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 218 runtime: arch = arm64 cpu_count = 4 goroutines = 183 max_procs = 4 os = linux version = go1.20.4 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 218 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 530 members = 2 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Consul 1.16 / Ubuntu 22.04

Log Fragments

relaxXiaoLuo commented 1 year ago

I encountered the same issue as you did. However, when I disabled ACL token authentication, the problem no longer occurred. My token was also set to "Global Management" policy. Finally, I tried downgrading the Consul version to below 1.15, and the issue was resolved. I was able to register, perform health checks, and deregister services without any problem.

danieleturani commented 1 year ago

We seem to have solved the issue by doing a consul leave and consul join on each Consul server (one at a time). What we noticed is that from version 1.16.x the Consul agent on Consul servers keeps restarting itself, causing a repeated leader re-election. Nomad is then unable to talk successfully with its local Consul agent, which leads Traefik to consider all servers of a service unhealthy.

Aug 30 04:59:19 knomadc4200 nomad[547]:     2023-08-30T04:59:19.242+0200 [INFO]  client.fingerprint_mgr.consul: consul agent is unavailable
Aug 30 04:59:24 knomadc4200 nomad[547]:     2023-08-30T04:59:24.233+0200 [WARN]  consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Aug 30 04:59:32 knomadc4200 nomad[547]:     2023-08-30T04:59:32.988+0200 [INFO]  consul.sync: successfully updated services in Consul
Aug 30 04:59:34 knomadc4200 nomad[547]:     2023-08-30T04:59:34.245+0200 [INFO]  client.fingerprint_mgr.consul: consul agent is available

Might not be the exact case as the OP, but maybe helps?

relaxXiaoLuo commented 1 year ago

The solution you mentioned, executing "consul leave" and "consul join" on each Consul server one by one, do you have a specific set of instructions for this? In my observation, I didn't notice repeatedly restarting in the Consul logs.

hashicorp / consul