hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Zombie alloc after client restart #23684

Open bernardoVale opened 1 month ago

bernardoVale commented 1 month ago

Nomad version

Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/

Issue

We've seen this issue a couple of times. It manifests as placement errors during deployments of services that must run on specific hosts. The deploy can't proceed because it must place the allocation in one host and that host still thinks one old allocation exists.

If an operator tries to kill the alloc in the UI (or in the cli), we see it briefly transition from RUNNING to COMPLETE and then back to RUNNING again.

Restarting the client doesn't help as well. The only fix we've found is by deleting nomad-client state:

sudo systemctl stop nomad-client.service
sudo mkdir /data/service/nomad-client/data/client/bkp
sudo mv /data/service/nomad-client/data/client/state.db.backup /data/service/nomad-client/data/client/bkp/
sudo mv /data/service/nomad-client/data/client/state.db /data/service/nomad-client/data/client/bkp/
sudo systemctl start nomad-client

On the last two occasions, I noticed a pattern. A nomad-client restart log roughly at the same time as the kill alloc log. So it seems that client fails to persist the state that the alloc is dead and gc-ed and then it starts the server (or client, I don't know how it works) to think the allocation still exists.

I attached the logs from the client, the allocation id is 7e7d1649-3ecf-a1b4-b258-157944a59831

Reproduction steps

It's hard to reproduce, I couldn't, but my best guess would be to:

  1. send a signal to restart/stop alloc
  2. immediately try to restart nomad client

Expected Result

Alloc state is COMPLETE

Actual Result

Alloc still exists and can't be killed

Nomad Server logs (if appropriate)

Let me know if it's relevant

Nomad Client logs (if appropriate)

logs

jrasell commented 1 month ago

Hi @bernardoVale and thanks for raising this issue. It sounds from your description like a race condition in the client when stopping an allocation and this process is interrupted at some point before state is committed by a Nomad agent process interruption.

If possible, it would be really useful to get some additional information about your cluster state via the API to piece together the information. The allocation information can be gathered via this API and the node allocations via this endpoint.

The Nomad client is getting the list of allocations and what state it should move them to from the Nomad servers. I suspect that due to the interruption mentioned, subsequent calls to stop/kill the allocation are failing as the client is having problems finding the allocation runner state or similar.

I attached the logs from the client, the allocation id is 7e7d1649-3ecf-a1b4-b258-157944a59831

I do not see these logs available on this issue; they would be useful to look at.

bernardoVale commented 1 month ago

Hi @jrasell thanks for the quick response, sorry I thought I had attached it, it's there now.

If possible, it would be really useful to get some additional information about your cluster state via the API to piece together the information. The allocation information can be gathered via this API and the node allocations via this endpoint.

It's gone now, but I'll try next time.

bernardoVale commented 1 month ago

It happened again, I attached the result of both API calls. The bad alloc is 2bca42e6-0500-c650-1fc5-28eae8e6b7d6

node_allocations.json bad-alloc.json

Relevant nomad-client logs:

Jul 29 23:21:53 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:53.963Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=e99c2a22-e114-890f-ed43-2a626cbb7700 task=cd-functest type=Received msg="Task received by client" failed=false
Jul 29 23:21:53 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:53.966Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6 task=cd-functest type=Killing msg="Sent interrupt. Waiting 30s before force killing" failed=false
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:54.032Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6 task=cd-functest type=Terminated msg="Exit Code: 0" failed=false
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.034Z [DEBUG] plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:54.046Z [WARN]  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:9c36ac7b4d79a7840d1a81eac2aa622afcfc1649dcfd64fa9fb4563594c657d4
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:54.046Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6 task=cd-functest type=Killed msg="Task successfully killed" failed=false
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.750Z [DEBUG] plugin: reattached plugin process exited
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]: 2024-07-29T23:21:54.750Z [DEBUG] plugin: plugin exited
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:54.872Z [INFO]  agent: (runner) stopping
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:54.872Z [INFO]  client.gc: marking allocation for GC: alloc_id=2bca42e6-0500-c650-1fc5-28eae8e6b7d6
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:54.872Z [INFO]  agent: (runner) received finish
Jul 29 23:21:54 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:54.881Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=e99c2a22-e114-890f-ed43-2a626cbb7700 task=cd-functest type="Task Setup" msg="Building Task Directory" failed=false
Jul 29 23:21:56 d4218.iad2.kentik.com systemd[1]: Stopping nomad-client.service - Nomad Client Service...
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]: ==> Caught signal: interrupt
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.576Z [INFO]  agent: requesting shutdown
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.576Z [INFO]  client: shutting down
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.680Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.680Z [INFO]  agent: (runner) creating watcher
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.681Z [INFO]  agent: (runner) starting
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.683Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/service/nomad-client/data/alloc/e99c2a22-e114-890f-ed43-2a626cbb7700/cd-functest/conf/lightstep.env"
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.772Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.772Z [INFO]  client.plugin: plugin manager finished: plugin-type=device
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.772Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.799Z [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.799Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.799Z [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Jul 29 23:21:56 d4218.iad2.kentik.com nomad-client[2229034]:     2024-07-29T23:21:56.860Z [INFO]  agent: shutdown complete
Jul 29 23:21:56 d4218.iad2.kentik.com systemd[1]: nomad-client.service: Main process exited, code=exited, status=1/FAILURE
mhehle commented 1 week ago

@bernardoVale what did you do to recover the client node?

bernardoVale commented 1 week ago

@bernardoVale what did you do to recover the client node?

@mhehle stop the agent, then delete agent's internal db

sudo systemctl stop nomad-client.service
sudo rm /data/service/nomad-client/data/client/state.db.backup
sudo rm /data/service/nomad-client/data/client/state.db
sudo systemctl start nomad-client
mhehle commented 1 week ago

@bernardoVale what did you do to recover the client node?

@mhehle stop the agent, then delete agent's internal db

sudo systemctl stop nomad-client.service
sudo rm /data/service/nomad-client/data/client/state.db.backup
sudo rm /data/service/nomad-client/data/client/state.db
sudo systemctl start nomad-client

Thank you @bernardoVale!