Open bernardoVale opened 3 months ago
Hi @bernardoVale and thanks for raising this issue. It sounds from your description like a race condition in the client when stopping an allocation and this process is interrupted at some point before state is committed by a Nomad agent process interruption.
If possible, it would be really useful to get some additional information about your cluster state via the API to piece together the information. The allocation information can be gathered via this API and the node allocations via this endpoint.
The Nomad client is getting the list of allocations and what state it should move them to from the Nomad servers. I suspect that due to the interruption mentioned, subsequent calls to stop/kill the allocation are failing as the client is having problems finding the allocation runner state or similar.
I attached the logs from the client, the allocation id is 7e7d1649-3ecf-a1b4-b258-157944a59831
I do not see these logs available on this issue; they would be useful to look at.
Hi @jrasell thanks for the quick response, sorry I thought I had attached it, it's there now.
If possible, it would be really useful to get some additional information about your cluster state via the API to piece together the information. The allocation information can be gathered via this API and the node allocations via this endpoint.
It's gone now, but I'll try next time.
It happened again, I attached the result of both API calls. The bad alloc is 2bca42e6-0500-c650-1fc5-28eae8e6b7d6
node_allocations.json bad-alloc.json
Relevant nomad-client logs:
Jul 29 23:21:53 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:53.963Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=e99c2a22-e114-890f-ed43-2a626cbb7700 task=cd-functest type=Received msg="Task received by client" failed=false
Jul 29 23:21:53 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:53.966Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6 task=cd-functest type=Killing msg="Sent interrupt. Waiting 30s before force killing" failed=false
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.032Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6 task=cd-functest type=Terminated msg="Exit Code: 0" failed=false
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.034Z [DEBUG] plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.046Z [WARN] client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:9c36ac7b4d79a7840d1a81eac2aa622afcfc1649dcfd64fa9fb4563594c657d4
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.046Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6 task=cd-functest type=Killed msg="Task successfully killed" failed=false
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.750Z [DEBUG] plugin: reattached plugin process exited
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.750Z [DEBUG] plugin: plugin exited
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.872Z [INFO] agent: (runner) stopping
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.872Z [INFO] client.gc: marking allocation for GC: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.872Z [INFO] agent: (runner) received finish
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.881Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=e99c2a22-e114-890f-ed43-2a626cbb7700 task=cd-functest type="Task Setup" msg="Building Task Directory" failed=false
Jul 29 23:21:56 d4218.iad2.kentik.com systemd[1]: Stopping nomad-client.service - Nomad Client Service...
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: ==> Caught signal: interrupt
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.576Z [INFO] agent: requesting shutdown
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.576Z [INFO] client: shutting down
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.680Z [INFO] agent: (runner) creating new runner (dry: false, once: false)
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.680Z [INFO] agent: (runner) creating watcher
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.681Z [INFO] agent: (runner) starting
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.683Z [INFO] agent: (runner) rendered "(dynamic)" => "/data/service/nomad-client/data/alloc/e99c2a22-e114-890f-ed43-2a626cbb7700/cd-functest/conf/lightstep.env"
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.772Z [INFO] client.plugin: shutting down plugin manager: plugin-type=device
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.772Z [INFO] client.plugin: plugin manager finished: plugin-type=device
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.772Z [INFO] client.plugin: shutting down plugin manager: plugin-type=driver
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.799Z [INFO] client.plugin: plugin manager finished: plugin-type=driver
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.799Z [INFO] client.plugin: shutting down plugin manager: plugin-type=csi
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.799Z [INFO] client.plugin: plugin manager finished: plugin-type=csi
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:56.860Z [INFO] agent: shutdown complete
Jul 29 23:21:56 d4218.iad2.kentik.com systemd[1]: nomad-client.service: Main process exited, code=exited, status=1/FAILURE
@bernardoVale what did you do to recover the client node?
@bernardoVale what did you do to recover the client node?
@mhehle stop the agent, then delete agent's internal db
sudo systemctl stop nomad-client.service
sudo rm /data/service/nomad-client/data/client/state.db.backup
sudo rm /data/service/nomad-client/data/client/state.db
sudo systemctl start nomad-client
@bernardoVale what did you do to recover the client node?
@mhehle stop the agent, then delete agent's internal db
sudo systemctl stop nomad-client.service sudo rm /data/service/nomad-client/data/client/state.db.backup sudo rm /data/service/nomad-client/data/client/state.db sudo systemctl start nomad-client
Thank you @bernardoVale!
Nomad version
Operating system and Environment details
Issue
We've seen this issue a couple of times. It manifests as placement errors during deployments of services that must run on specific hosts. The deploy can't proceed because it must place the allocation in one host and that host still thinks one old allocation exists.
If an operator tries to kill the alloc in the UI (or in the cli), we see it briefly transition from
RUNNING
toCOMPLETE
and then back toRUNNING
again.Restarting the client doesn't help as well. The only fix we've found is by deleting nomad-client state:
On the last two occasions, I noticed a pattern. A nomad-client restart log roughly at the same time as the kill alloc log. So it seems that client fails to persist the state that the alloc is dead and gc-ed and then it starts the server (or client, I don't know how it works) to think the allocation still exists.
I attached the logs from the client, the allocation id is
7e7d1649-3ecf-a1b4-b258-157944a59831
Reproduction steps
It's hard to reproduce, I couldn't, but my best guess would be to:
Expected Result
Alloc state is
COMPLETE
Actual Result
Alloc still exists and can't be killed
Nomad Server logs (if appropriate)
Let me know if it's relevant
Nomad Client logs (if appropriate)
logs