Open blmhemu opened 1 year ago
I encountered the same issue as you did. However, when I disabled ACL token authentication, the problem no longer occurred. My token was also set to "Global Management" policy. Finally, I tried downgrading the Consul version to below 1.15, and the issue was resolved. I was able to register, perform health checks, and deregister services without any problem.
We seem to have solved the issue by doing a consul leave
and consul join
on each Consul server (one at a time). What we noticed is that from version 1.16.x the Consul agent on Consul servers keeps restarting itself, causing a repeated leader re-election. Nomad is then unable to talk successfully with its local Consul agent, which leads Traefik to consider all servers of a service unhealthy.
Aug 30 04:59:19 knomadc4200 nomad[547]: 2023-08-30T04:59:19.242+0200 [INFO] client.fingerprint_mgr.consul: consul agent is unavailable
Aug 30 04:59:24 knomadc4200 nomad[547]: 2023-08-30T04:59:24.233+0200 [WARN] consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Aug 30 04:59:32 knomadc4200 nomad[547]: 2023-08-30T04:59:32.988+0200 [INFO] consul.sync: successfully updated services in Consul
Aug 30 04:59:34 knomadc4200 nomad[547]: 2023-08-30T04:59:34.245+0200 [INFO] client.fingerprint_mgr.consul: consul agent is available
Might not be the exact case as the OP, but maybe helps?
The solution you mentioned, executing "consul leave" and "consul join" on each Consul server one by one, do you have a specific set of instructions for this? In my observation, I didn't notice repeatedly restarting in the Consul logs.
Overview of the Issue
Done nomad / consul integration as per https://developer.hashicorp.com/nomad/docs/integrations/consul-integration Seems like service is registered, but the health check is not working (for both nomad-server and client)
There also seems to be a stale check which is not being deregistered (This is a single server cluster - so it is not from another server)
Surprisingly, I cannot see any
[Error] http check failed
kinda logs - only the service registration logs.Reproduction Steps
Consul ACL for nomad
Use CONSUL_HTTP_TOKEN env var in nomad and enable consul integration.
Tried changing advertising address / etc. in nomad, could not get it working.
Consul info for both Client and Server
Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 3 services = 3 build: prerelease = revision = 192df66a version = 1.16.0 version_metadata = consul: acl = enabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.1.1.1:8300 server = true raft: applied_index = 681355 commit_index = 681355 fsm_pending = 0 last_contact = 0 last_log_index = 681355 last_log_term = 218 last_snapshot_index = 671815 last_snapshot_term = 207 latest_configuration = [{Suffrage:Voter ID:31172a93-71b5-9e17-83b5-7bc8e550e51c Address:10.1.1.1:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 218 runtime: arch = arm64 cpu_count = 4 goroutines = 183 max_procs = 4 os = linux version = go1.20.4 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 218 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 530 members = 2 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 ```Operating system and Environment details
Consul 1.16 / Ubuntu 22.04
Log Fragments