Closed dani closed 4 months ago
Another way to repro is to do
systemctl restart nomad
on client(s) - When I run this, service templates are getting messed up.
@blmhemu can you clarify this? What's "messed up" mean here? ~(Note that https://github.com/hashicorp/nomad/issues/19542 is unrelated, so if the API output is fine, you should take that over to #19542 and that'll get resolved over there.)~ Nevermind, I see what you mean in https://github.com/hashicorp/nomad/issues/18203.
Thanks @linuxoid69! So that looks like it's the Consul agent token configuration. The allowed configuration for that changed a while back, so that reinforces what was described in https://github.com/hashicorp/nomad/issues/16616#issuecomment-1494368489.
Hi folks, just a quick update... I still don't have a reproduction for this issue. But by revisiting the code, I see a few places where it's possible to drop deregistrations and places where we could be ensuring data integrity between allocations and their service registrations.
When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services (ref node_endpoint.go#L704-L718
). This makes it possible for the leader to successfully write the node update to Raft without removing services.
When the allocation is stopped, we deregister the service in the alloc runner's PreKill
hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload ignores any errors. The RemoveWorkload
method in the service client for NSD doesn't retry if there's any issue either, it only logs. The PreKill
hook interface doesn't allow for errors to block the shutdown of an allocation, but what we can do is give the RemoveWorkload
a chance to retry recoverable errors from transient network issues.
When an allocation is marked terminal from the client in UpdateAllocsFromClient
, we have an opportunity to ensure data integrity by deregistering services for that allocation. This doesn't quite ensure correct behavior on its own, as we're supposed to be deregistering before the allocation is terminal. But it at least prevents the services from sitting around.
When an allocation is deleted (during eval garbage collection, see state_store.go#L3617-L3632
), we have the same opportunity to ensure data integrity.
Without a reliable reproduction that covers everyone's reports I can't guarantee that fixing the above problems will close out this issue for Nomad services forever. But these are at least plausibly involved. I'm going to work up a patch or patch series for these, and should have those up for review in the next few days assuming all goes well.
Thanks again everyone for your patience with this issue!
I've broken out the data integrity fixes to https://github.com/hashicorp/nomad/pull/20590, and I'll do the client-side work as a separate PR.
Client-side work is up in https://github.com/hashicorp/nomad/pull/20596
Ok folks, #20596 and #20590 have both been merged and will ship in the upcoming Nomad 1.8.0 (and supported backport versions). I'm going to close this issue.
"But I saw it again!"
If you see services left behind after an allocation has stopped from Nomad 1.8.0 or beyond, please let us know. We may move reports to a new issue in order to properly triage. Make sure to include the following:
"consul"
or "nomad"
)"consul"
service provider)consul
config block and Consul agent's config (if using the "consul"
service provider)@tgross Looks like we have indication of this again in Nomad v1.8.1
Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider) Nomad Agent Consul Block
consul {
address = "127.0.0.1:8500"
token = "<CONSUL BOOTSTRAP TOKEN>"
}
Consul Agent Config
ui = true
log_level = "INFO"
data_dir = "/opt/hashicorp/consul/data"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
advertise_addr = "x.x.x.18"
retry_join = ["project_name=dev-nomad provider=gce tag_value=dev-dev1-dc1-nomad-auto-join"]
datacenter = "dc1"
## set up logging
log_file = "/var/log/consul/" # service creates a log file under this dir
log_rotate_bytes = 1073741824 # 1GB
log_rotate_max_files = 1 # keep 1 additional file jic for troubleshooting
log_json = true
log_level = "INFO"
enable_syslog = false
acl {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
}
connect {
enabled = true
}
ports {
grpc = 8502
}
Will work to get a working reproduction on the side to deliver if needed just let me know in Slack. Wanted to post here to keep track.
Hi @tgross ,
We also still have this issue in our CI environment (1 server nomad + consul + vault + 3 clients). We start to observe an increase on this after we enabled WI for Consul.
Which service provider you're seeing the problem with ("consul" or "nomad"): Consul
Which version of Nomad you're using: 1.8.1
Which version of Consul you're using (if using the "consul" service provider): 1.18.1
Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider):
{
"consul": {
"address": "127.0.0.1:8501",
"grpc_address": "127.0.0.1:8503",
"auto_advertise": true,
"server_auto_join": true,
"client_auto_join": true,
"token": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"ssl": true,
"ca_file": "/opt/consul/certs/consul-ca-cert.pem",
"cert_file": "/opt/consul/certs/consul-server-cert.pem",
"key_file": "/opt/consul/certs/consul-server-key.pem",
"grpc_ca_file": "/opt/consul/certs/consul-ca-cert.pem",
"verify_ssl": false
}
}
"node_name": "<node_name>",
"addresses":{
"http": "0.0.0.0",
"https": "0.0.0.0",
"dns": "0.0.0.0",
"grpc_tls": "0.0.0.0"
},
"data_dir": "/opt/consul/data",
"log_level": "INFO",
"datacenter": "dc1",
"bind_addr": "<ip_addr>",
"advertise_addr": "<ip_addr>",
"ports": {
"server": 8300,
"serf_lan": 8301,
"serf_wan": 8302,
"http": 8500,
"https": 8501,
"grpc_tls": 8503,
"dns": 8600,
"sidecar_min_port": 21000,
"sidecar_max_port": 21255,
"expose_min_port": 21500,
"expose_max_port": 21755
},
"server": false,
"retry_join": ["<consul_server>"],
"rejoin_after_leave": true,
"connect": {
"enabled": true
},
"encrypt": "<some_secret>",
"tls": {
"defaults": {
"ca_file": "/opt/consul/certs/consul-ca-cert.pem",
"cert_file": "/opt/consul/certs/consul-server-cert.pem",
"key_file": "/opt/consul/certs/consul-server-key.pem",
"verify_incoming": true,
"verify_outgoing": true
},
"https": {
"verify_incoming": true
},
"internal_rpc": {
"verify_server_hostname": true,
"verify_incoming": true
},
"grpc": {
"verify_incoming": false
}
},
"auto_encrypt": {
"tls": true
},
"limits": {
"http_max_conns_per_client": 8000,
"https_handshake_timeout": "15s",
"rpc_max_conns_per_client": 400,
"rpc_handshake_timeout": "15s"
}
"acl": {
"enabled": true,
"tokens": {
"agent": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
},
"default_policy": "deny",
"down_policy": "extend-cache",
"enable_token_persistence": true
}
The sequence of events as much as you know (ex. "the Nomad agent was restarted, then I submitted a new version of the job and the allocation was replaced")
Client logs for the allocation as it's shutting down (in this case a new deploy for an existent job is submitted with allowed restarts 3)
Nomad:
2024-07-04T08:13:06.785415+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:06.785Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Received msg="Task received by client" failed=false
2024-07-04T08:13:06.820716+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:06.820Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Task Setup" msg="Building Task Directory" failed=false
2024-07-04T08:13:06.888908+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:06.888Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
2024-07-04T08:13:09.227594+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.227Z [INFO] agent: (runner) creating new runner (dry: false, once: false)
2024-07-04T08:13:09.228156+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.228Z [WARN] agent: (clients) disabling consul SSL verification
2024-07-04T08:13:09.228795+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.228Z [INFO] agent: (runner) creating watcher
2024-07-04T08:13:09.229282+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.229Z [INFO] agent: (runner) starting
2024-07-04T08:13:09.250898+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.250Z [WARN] agent: (runner) watching 186 dependencies - watching this many dependencies could DDoS your servers
2024-07-04T08:13:09.291654+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.291Z [INFO] agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/logback.xml"
2024-07-04T08:13:09.316109+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.315Z [INFO] agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/logrotate"
2024-07-04T08:13:09.344981+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.344Z [INFO] agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/envs"
2024-07-04T08:13:09.351655+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.351Z [WARN] agent: (runner) watching 187 dependencies - watching this many dependencies could DDoS your servers
2024-07-04T08:13:09.380496+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.380Z [INFO] agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/backoffice-parking-consumer.conf"
2024-07-04T08:13:09.494995+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.494Z [INFO] agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/main.properties"
2024-07-04T08:13:09.674300+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:09.674Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:13:16.995273+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:16.995Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=70f26f726a384f5be291ccd03e6cc5a73d48fd40f640ae622d168821e30415fe
2024-07-04T08:13:17.440496+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:17.439Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=70f26f726a384f5be291ccd03e6cc5a73d48fd40f640ae622d168821e30415fe
2024-07-04T08:13:17.493295+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:17.493Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:13:36.986712+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:36.986Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:13:36.993466+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:36.993Z [INFO] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2596806
2024-07-04T08:13:37.006807+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:37.006Z [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=35.272860626s
2024-07-04T08:13:37.007018+00:00 <node_name> nomad[4008547]: 2024-07-04T08:13:37.006Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 35.272860626s" failed=fals
2024-07-04T08:14:12.446927+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:12.446Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:14:12.907417+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:12.907Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=dbbab1efe68e172a7e0f56319fad010db22b626ce93f730dd11487e963ed6c8a
2024-07-04T08:14:13.405087+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:13.404Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=dbbab1efe68e172a7e0f56319fad010db22b626ce93f730dd11487e963ed6c8a
2024-07-04T08:14:13.463133+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:13.463Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:14:32.409339+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:32.408Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:14:32.416647+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:32.416Z [INFO] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2604407
2024-07-04T08:14:32.428570+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:32.428Z [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=35.718148757s
2024-07-04T08:14:32.428710+00:00 <node_name> nomad[4008547]: 2024-07-04T08:14:32.428Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 35.718148757s" failed=false
2024-07-04T08:15:08.255694+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:08.254Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:15:08.761774+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:08.761Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=68ba996f95abdf00048c15a0d06cb327341915ed2217cfac09684309772d5cc2
2024-07-04T08:15:09.206826+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:09.206Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=68ba996f95abdf00048c15a0d06cb327341915ed2217cfac09684309772d5cc2
2024-07-04T08:15:09.257945+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:09.257Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:15:27.751406+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:27.751Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:15:27.758280+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:27.758Z [INFO] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2611952
2024-07-04T08:15:27.769543+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:27.769Z [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=32.650043057s
2024-07-04T08:15:27.769613+00:00 <node_name> nomad[4008547]: 2024-07-04T08:15:27.769Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 32.650043057s" failed=false
2024-07-04T08:16:00.454924+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:00.454Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:16:00.873403+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:00.872Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=895827c73be9faa3e76bf68ca615c445307ca3cf8b6f0db1d94baeee20fdd920
2024-07-04T08:16:01.344882+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:01.344Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=895827c73be9faa3e76bf68ca615c445307ca3cf8b6f0db1d94baeee20fdd920
2024-07-04T08:16:01.406204+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:01.406Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:16:19.587339+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.587Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:16:19.593708+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.593Z [INFO] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2618891
2024-07-04T08:16:19.604943+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.604Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Exceeded allowed attempts 3 in interval 30m0s and mode is \"fail\""
2024-07-04T08:16:19.605031+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.604Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Not Restarting" msg="Exceeded allowed attempts 3 in interval 30m0s and mode is \"fail\"" failed=true
2024-07-04T08:16:19.614604+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.614Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer plugin=/opt/nomad/bin/nomad id=2595171
2024-07-04T08:16:19.614846+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.614Z [INFO] agent: (runner) stopping
2024-07-04T08:16:19.616711+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.616Z [INFO] client.gc: marking allocation for GC: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e
2024-07-04T08:16:19.616802+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.616Z [INFO] agent: (runner) received finish
2024-07-04T08:16:19.639835+00:00 <node_name> nomad[4008547]: 2024-07-04T08:16:19.637Z [WARN] consul.sync: failed to update services in Consul: error="Unexpected response code: 403 (ACL not found)"
Consul:
2024-07-04T08:13:06.832454+00:00 <node_name> consul[4075309]: 2024-07-04T08:13:06.832Z [INFO] agent: Synced service: service=_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer-
2024-07-04T08:16:19.637477+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:19.637Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:19.639781+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:19.637Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:20.643069+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:20.642Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:20.643185+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:20.642Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:22.649259+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:22.648Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:22.649380+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:22.648Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:25.657196+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:25.656Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:25.657348+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:25.656Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:29.664423+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:29.664Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:29.664523+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:29.664Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:34.670396+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:34.669Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:34.670535+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:34.669Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:40.676407+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:40.675Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:40.676589+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:40.675Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:47.682542+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:47.681Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:47.682691+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:47.682Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:55.689246+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:55.688Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:55.689333+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:55.688Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
Any server logs that include errors around the deregistration process or other triggering event (ex. if the node was marked "down", provide any logs you have around that event too)
Thanks
@ngcmac I'm going to continue the investigation for the case where we're getting "ACL not found" with Workload Identities in https://github.com/hashicorp/nomad/issues/23494. @natemollica-nm is going to do some follow-up on his report and we may or may not adjust #23494 to cover that case as well depending on the outcome.
Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg
Environment :
Haven't found yet a pattern to reproduce it 100% of the time