Nomad Client stuck initializing

bernardoVale commented 1 year ago

Nomad version

Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

Debian GNU/Linux 11 (bullseye)

Issue

After a client restart, node gets stuck in initializing state. Once it gets stuck, only a restart makes it recover.

nomad node status -verbose

ID                                    Node Pool  DC   Name                 Class    Address     Version  Drain  Eligibility  Status
3de7fa5a-336f-d976-1c14-7a6caafd564e  default    acme  s505.acme.com  storage  10.5.1.210  1.6.1    false  eligible     ready
fa18d47e-c16e-947a-958b-6486a862516a  default    acme  s402.acme.com  storage  10.5.1.203  1.6.1    false  eligible     ready
7d0e570c-0615-34a4-227b-568e8af91ca5  default    acme  s403.acme.com  storage  10.5.1.205  1.6.1    false  eligible     ready
625b55d4-b763-ee41-8b5c-bd93ee15cfec  default    acme  s502.acme.com  storage  10.5.1.204  1.6.1    false  eligible     ready
23523791-f4d1-499c-834c-3af34d0d6ffb  default    acme  s501.acme.com  storage  10.5.1.202  1.6.1    false  eligible     ready
34612ebb-4418-3d2d-bf70-0e188451429e  default    acme  s404.acme.com  storage  10.5.1.207  1.6.1    false  eligible     ready
a6fee44f-5398-dae6-049d-e65548f5d076  default    acme  s506.acme.com  storage  10.5.1.212  1.6.1    false  eligible     ready
7adfb092-6b29-6dab-12bf-afe5a6b6c900  default    acme  s405.acme.com  storage  10.5.1.209  1.6.1    false  eligible     ready
eadfd1af-91d8-860c-3a81-022bb94c9db5  default    acme  s503.acme.com  storage  10.5.1.206  1.6.1    false  eligible     initializing

nomad node status -verbose eadfd1af-91d8-860c-3a81-022bb94c9db5

ID              = eadfd1af-91d8-860c-3a81-022bb94c9db5
Name            = s503.acme.com
Node Pool       = default
Class           = storage
DC              = acme
Drain           = false
Eligibility     = eligible
Status          = initializing
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 4840h58m33s

Host Volumes
Name          ReadOnly  Source
blob-storage  false     /data/blob-storage

Drivers
Driver    Detected  Healthy  Message   Time
docker    true      true     Healthy   2023-09-20T13:35:08Z
exec      true      true     Healthy   2023-09-20T13:35:08Z
java      false     false    <none>    2023-09-20T13:35:08Z
qemu      false     false    <none>    2023-09-20T13:35:08Z
raw_exec  false     false    disabled  2023-09-20T13:35:08Z

Node Events
Time                  Subsystem  Message                Details
2023-09-20T13:35:08Z  Cluster    Node heartbeat missed  <none>
2023-05-30T12:03:16Z  Cluster    Node re-registered     <none>
2023-05-30T11:48:28Z  Cluster    Node heartbeat missed  <none>
2023-03-03T07:35:38Z  Cluster    Node re-registered     <none>
2023-03-03T00:56:25Z  Cluster    Node heartbeat missed  <none>
2022-11-22T14:25:24Z  Cluster    Node registered        <none>

So far, every time we experience this problem, the node contains a Node heartbeat missed event.

Reproduction steps

We don't know how to reproduce it yet.

Expected Result

Client restarts and eventually transition to ready state.

Actual Result

Client restarts and might get stuck in initializing state

Nomad Server logs (if appropriate)

Sep 20 01:17:40 c206.acme.com nomad-server[3895982]:     2023-09-20T01:17:40.782Z [INFO]  nomad.raft: starting snapshot up to: index=1589250
Sep 20 01:17:40 c206.acme.com nomad-server[3895982]:     2023-09-20T01:17:40.782Z [INFO]  snapshot: creating new snapshot: path=/data/service/nomad-server/data/server/raft/snapshots/589114-1589250-1695172660782.tmp
Sep 20 01:17:40 c206.acme.com nomad-server[3895982]:     2023-09-20T01:17:40.905Z [INFO]  snapshot: reaping snapshot: path=/data/service/nomad-server/data/server/raft/snapshots/589114-1572852-1694931874411
Sep 20 01:17:40 c206.acme.com nomad-server[3895982]:     2023-09-20T01:17:40.906Z [INFO]  nomad.raft: compacting logs: from=1570815 to=1579010
Sep 20 01:17:40 c206.acme.com nomad-server[3895982]:     2023-09-20T01:17:40.958Z [INFO]  nomad.raft: snapshot complete up to: index=1589250
Sep 20 13:33:33 c206.acme.com nomad-server[3895982]:     2023-09-20T13:33:33.748Z [WARN]  nomad.heartbeat: node TTL expired: node_id=3c6f37a6-d45d-925d-90de-0ddc7321d679
Sep 20 13:34:26 c206.acme.com nomad-server[3895982]:     2023-09-20T13:34:26.565Z [WARN]  nomad.heartbeat: node TTL expired: node_id=f6b4703e-bdca-a959-90b8-60637a165592
Sep 20 13:35:08 c206.acme.com nomad-server[3895982]:     2023-09-20T13:35:08.316Z [WARN]  nomad.heartbeat: node TTL expired: node_id=eadfd1af-91d8-860c-3a81-022bb94c9db5

Nomad Client logs (if appropriate)

Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.435Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.435Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.469Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=102852d4-8b75-e5b4-fdb3-03319ee6c8bc task=apigw-envoy-be type=Received msg="Task received by client" failed=false
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.470Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=102852d4-8b75-e5b4-fdb3-03319ee6c8bc task=proxy-init type=Received msg="Task received by client" failed=false
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.483Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=491e3dfa-68d2-61d3-2971-3f459c98fe91 task=tag-sync type=Received msg="Task received by client" failed=false
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.490Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=61647409-6e33-b6d8-5275-cc1aac95da1b task=kde-planner type=Received msg="Task received by client" failed=false
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.498Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a1e0d299-5b7d-e551-d885-7589b6472986 task=kde-cache type=Received msg="Task received by client" failed=false
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.503Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent type=Received msg="Task received by client" failed=false
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.517Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=fcacf1a5-10f3-29cb-618e-e51b937fa6da task=kde-worker-b type=Received msg="Task received by client" failed=false
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.523Z [INFO]  client: started client: node_id=eadfd1af-91d8-860c-3a81-022bb94c9db5
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.523Z [INFO]  client: node registration complete
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.570Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent path=/data/service/nomad-client/data/alloc/ed544457-3a35-6b86-5219-1b1b205365b6/controlplane-agent/secrets/api.sock error="listen>
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.612Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.614Z [INFO]  agent: (runner) creating watcher
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.614Z [INFO]  agent: (runner) starting
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.652Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.653Z [INFO]  agent: (runner) creating watcher
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.653Z [INFO]  agent: (runner) starting
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.680Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.682Z [INFO]  agent: (runner) creating watcher
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.682Z [INFO]  agent: (runner) starting
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.710Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.711Z [INFO]  agent: (runner) creating watcher
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.712Z [INFO]  agent: (runner) starting
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.741Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.742Z [INFO]  agent: (runner) creating watcher
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.743Z [INFO]  agent: (runner) starting
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.783Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.784Z [INFO]  agent: (runner) creating watcher
Sep 20 13:35:08 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:08.784Z [INFO]  agent: (runner) starting
Sep 20 13:35:18 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:18.070Z [INFO]  client: node registration complete

tgross commented 1 year ago

Hi @bernardoVale! If the node is unable to heartbeat, that's the behavior I'd expect. The server leaves clients in the initializing state until they successfully heartbeat once. This prevents a race condition where the node comes up, says hi, and immediately has its work stopped because the server doesn't have a valid heartbeat.

The logs of most interest here would those following the client: node registration complete and what's happening with the client at that point.

bernardoVale commented 1 year ago

The server leaves clients in the initializing state until they successfully heartbeat once

Maybe the problem is that the client thinks it's initialized but the server is still waiting for a successful heartbeat, so the client never retries

The logs of most interest here would those following the client: node registration complete and what's happening with the client at that point.

That's the problem. That's the last line. I haven't recovered the client yet; there's still a window for debug.

Last two lines:

Sep 20 13:35:18 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:18.070Z [INFO]  client: node registration complete
Sep 20 13:35:26 s503.acme.com nomad-client[2331106]: ==> Newer Nomad version available: 1.6.2 (currently running: 1.6.1)

bernardoVale commented 1 year ago

@tgross here's an interesting thing. I manually stopped a container in that client to see what happens and it transitioned to ready automatically. Here are the new logs:

Sep 20 13:35:18 s503.acme.com nomad-client[2331106]:     2023-09-20T13:35:18.070Z [INFO]  client: node registration complete
Sep 20 13:35:26 s503.acme.com nomad-client[2331106]: ==> Newer Nomad version available: 1.6.2 (currently running: 1.6.1)
Sep 20 19:14:51 s503.acme.com nomad-client[2331106]:     2023-09-20T19:14:51.891Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent type=Terminated msg="Exit Code: 0" failed=false
Sep 20 19:14:51 s503.acme.com nomad-client[2331106]: 2023-09-20T19:14:51.891Z [DEBUG] plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
Sep 20 19:14:52 s503.acme.com nomad-client[2331106]:     2023-09-20T19:14:52.026Z [WARN]  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:8e3fdc0ef916e5b262568abfc2bf27a990decdff67070228b372233e7ad68cdd
Sep 20 19:14:52 s503.acme.com nomad-client[2331106]:     2023-09-20T19:14:52.026Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent reason="Restart within policy" delay=23.405899587s
Sep 20 19:14:52 s503.acme.com nomad-client[2331106]:     2023-09-20T19:14:52.026Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent type=Restarting msg="Task restarting in 23.405899587s" failed=false
Sep 20 19:14:52 s503.acme.com nomad-client[2331106]:     2023-09-20T19:14:52.234Z [WARN]  client: missed heartbeat: req_latency=63.692148ms heartbeat_ttl=17.368110051s since_last_heartbeat=9.097260254s
Sep 20 19:14:52 s503.acme.com nomad-client[2331106]: 2023-09-20T19:14:52.505Z [DEBUG] plugin: reattached plugin process exited
Sep 20 19:14:52 s503.acme.com nomad-client[2331106]: 2023-09-20T19:14:52.505Z [DEBUG] plugin: plugin exited
Sep 20 19:15:02 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:02.971Z [INFO]  agent: (runner) stopping
Sep 20 19:15:02 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:02.971Z [INFO]  agent: (runner) received finish
Sep 20 19:15:02 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:02.971Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 20 19:15:02 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:02.972Z [INFO]  agent: (runner) creating watcher
Sep 20 19:15:02 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:02.972Z [INFO]  agent: (runner) starting
Sep 20 19:15:15 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:15.570Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent error="plugin is shut down"
Sep 20 19:15:15 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:15.570Z [WARN]  client.alloc_runner.task_runner.task_hook: logmon shutdown while making request: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent error="plugin is shut down"
Sep 20 19:15:15 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:15.570Z [WARN]  client.alloc_runner.task_runner.task_hook: logmon shutdown while making request; retrying: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent attempts=1 error="plugin is shut down"
Sep 20 19:15:16 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:16.608Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent path=/data/service/nomad-client/data/alloc/ed544457-3a35-6b86-5219-1b1b205365b6/controlplane-agent/secrets/api.sock error="listen unix /data/service/nomad-client/data/alloc/ed544457-3a35-6b86-5219-1b1b205365b6/controlplane-agent/secrets/api.sock: bind: invalid argument"
Sep 20 19:15:17 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:17.156Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=6c7b1007c9668cbf5554d93e8921d97a23a1274fd99797a433edd8e29c2830dd
Sep 20 19:15:17 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:17.370Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=6c7b1007c9668cbf5554d93e8921d97a23a1274fd99797a433edd8e29c2830dd
Sep 20 19:15:17 s503.acme.com nomad-client[2331106]:     2023-09-20T19:15:17.403Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=ed544457-3a35-6b86-5219-1b1b205365b6 task=controlplane-agent type=Started msg="Task started by client" failed=false

tgross commented 1 year ago

Maybe the problem is that the client thinks it's initialized but the server is still waiting for a successful heartbeat, so the client never retries

Ok, but when you run nomad node status it doesn't go and check in with the client. So you're necessarily always seeing the server's view of the client. Depending on the status field, that view could be either because it got an update from the client or because it hasn't heard from the client in a while and has done something like marked the node lost. In this case I'm pretty sure they line up anyways, but just an important tip for the future.

This log line in particular is interesting, because we see it missed a hearbeat after having a successful one only 9s prior. Do you have node events from nomad node status for this node around the time you stopped the container?

Sep 20 19:14:52 s503.acme.com nomad-client[2331106]: 2023-09-20T19:14:52.234Z [WARN] client: missed heartbeat: req_latency=63.692148ms heartbeat_ttl=17.368110051s since_last_heartbeat=9.097260254s

That's the problem. That's the last line. I haven't recovered the client yet; there's still a window for debug. ... @tgross here's an interesting thing. I manually stopped a container in that client to see what happens and it transitioned to ready automatically.

That worries me a bit. We did some work in recent versions to reduce the number of updates (and therefore server load) by being more aggressive about batching them. A task event like an allocation stopping would "kick" the trigger again. That implies a possible bug in the heartbeat timing.

If this happens again, can you take a goroutine profile of the "stuck" node? You can do this by either:

/v1/agent/pprof/goroutine?debug=2 on the client if you have ACLs enabled
/debug/pprof/goroutine?debug=2 on the client if you have enable_debug set
sending SIGQUIT to the agent and getting the profile from the logs (this will kill the client)

You can then email that to nomad-oss-debug@hashicorp.com (a write-only mailing list) and refer to this issue and we'll be able to do an analysis of that for you.

bernardoVale commented 1 year ago

Ok, but when you run nomad node status it doesn't go and check in with the client. So you're necessarily always seeing the server's view of the client. Depending on the status field, that view could be either because it got an update from the client or because it hasn't heard from the client in a while and has done something like marked the node lost. In this case I'm pretty sure they line up anyways, but just an important tip for the future.

I'm unsure if I followed what you meant here, but here's another important information. We have a bunch of jobs that are pinned to certain hosts, we spotted this problem due to deployment failures.

If someone tries to deploy a job that is pinned to a stuck host it will lead to "placement failures" because from server POV the node is initializing.

This log line in particular is interesting, because we see it missed a hearbeat after having a successful one only 9s prior. Do you have node events from nomad node status for this node around the time you stopped the container?

There was no additional event:

Node Events
Time                  Subsystem  Message                Details
2023-09-20T13:35:08Z  Cluster    Node heartbeat missed  <none>
2023-05-30T12:03:16Z  Cluster    Node re-registered     <none>
2023-05-30T11:48:28Z  Cluster    Node heartbeat missed  <none>
2023-03-03T07:35:38Z  Cluster    Node re-registered     <none>
2023-03-03T00:56:25Z  Cluster    Node heartbeat missed  <none>
2022-11-22T14:25:24Z  Cluster    Node registered        <none>

If this happens again, can you take a goroutine profile of the "stuck" node?

Sure, I will

bernardoVale commented 1 year ago

Hi @tgross we got another stuck node today. I sent the SIGQUIT dump via e-mail. I couldn't take a goroutine profile, we don't have ACLs enabled or an enable_debug set. I'll set it so we can provide the profile next time.

ingwarsw commented 1 year ago

Hello @tgross, Im working with @bernardoVale. Today I found even more strange example of that issue.

Nomad client were working, then as usual missed heartbeat and went into initialising state forever. But this time restarting of nomad-client did not helped, after restart client went into initialising state again.

After restart I found in logs one strange line that helped me to push it further.

...
Sep 21 11:20:37 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:20:37.774Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=5acf0daf-60ce-48e5-d304-a5ae2d2b733f task=controlplane-agent path=/data/service/nomad-client/data/alloc/5acf0daf-60ce-48e5-d304-a5ae2d2b733f/controlplane-agent/secrets/api.sock error="listen unix /data/service/nomad-client/data/alloc/5acf0daf-60ce-48e5-d304-a5ae2d2b733f/controlplane-agent/secrets/api.sock: bind: invalid argument"
...
Sep 21 11:20:47 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:20:47.626Z [INFO]  client: node registration complete
# and it were stuck here in initialising

After killing this alloc things went event more crazy, seems like whole nomad-client died?

Sep 21 11:24:06 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:06.597Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5acf0daf-60ce-48e5-d304-a5ae2d2b733f task=controlplane-agent type=Killing msg="Sent interrupt. Waiting 30s before force killing" failed=false
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.099Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5acf0daf-60ce-48e5-d304-a5ae2d2b733f task=controlplane-agent type=Terminated msg="Exit Code: 0" failed=false
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]: 2023-09-21T11:24:07.100Z [DEBUG] plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.321Z [WARN]  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:c58ea28eb4970ac295709fd9f65509b125ae168fd8246c822666c5a51b5b748e
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.321Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5acf0daf-60ce-48e5-d304-a5ae2d2b733f task=controlplane-agent type=Killed msg="Task successfully killed" failed=false
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.418Z [WARN]  client: missed heartbeat: req_latency=47.740109ms heartbeat_ttl=19.521786646s since_last_heartbeat=838.741968ms
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.518Z [INFO]  agent: (runner) stopping
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.518Z [INFO]  agent: (runner) received finish
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.518Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.522Z [INFO]  agent: (runner) creating watcher
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.522Z [INFO]  agent: (runner) starting
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]: 2023-09-21T11:24:07.651Z [DEBUG] plugin: reattached plugin process exited
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]: 2023-09-21T11:24:07.651Z [DEBUG] plugin: plugin exited
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.770Z [INFO]  agent: (runner) stopping
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.770Z [INFO]  client.gc: marking allocation for GC: alloc_id=5acf0daf-60ce-48e5-d304-a5ae2d2b733f
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.770Z [INFO]  agent: (runner) received finish
Sep 21 11:24:07 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:07.997Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1236b8fc-9c9e-84bc-e561-39bdbc163810 task=controlplane-agent type=Received msg="Task received by client" failed=false
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.019Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1236b8fc-9c9e-84bc-e561-39bdbc163810 task=controlplane-agent type="Task Setup" msg="Building Task Directory" failed=false
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.161Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=1236b8fc-9c9e-84bc-e561-39bdbc163810 task=controlplane-agent path=/data/service/nomad-client/data/alloc/1236b8fc-9c9e-84bc-e561-39bdbc163810/controlplane-agent/secrets/api.sock error="listen unix /data/service/nomad-client/data/alloc/1236b8fc-9c9e-84bc-e561-39bdbc163810/controlplane-agent/secrets/api.sock: bind: invalid argument"
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.484Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.485Z [INFO]  agent: (runner) creating watcher
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.485Z [INFO]  agent: (runner) starting
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.491Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/service/nomad-client/data/alloc/1236b8fc-9c9e-84bc-e561-39bdbc163810/controlplane-agent/local/extra.env"
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.496Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/service/nomad-client/data/alloc/1236b8fc-9c9e-84bc-e561-39bdbc163810/controlplane-agent/conf/lightstep.env"
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.505Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/service/nomad-client/data/alloc/1236b8fc-9c9e-84bc-e561-39bdbc163810/controlplane-agent/secrets/service.env"
Sep 21 11:24:08 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:08.507Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/service/nomad-client/data/alloc/1236b8fc-9c9e-84bc-e561-39bdbc163810/controlplane-agent/secrets/lightstep.env"
Sep 21 11:24:09 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:09.371Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=334657c3b0c88c5d7b54f60d9e7767cf757d016811c9524796b16952fa3603d7
Sep 21 11:24:09 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:09.622Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=334657c3b0c88c5d7b54f60d9e7767cf757d016811c9524796b16952fa3603d7
Sep 21 11:24:09 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:09.664Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1236b8fc-9c9e-84bc-e561-39bdbc163810 task=controlplane-agent type=Started msg="Task started by client" failed=false
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]: ==> Caught signal: interrupt
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.545Z [INFO]  agent: requesting shutdown
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.545Z [INFO]  client: shutting down
Sep 21 11:24:14 s313.iad1.kentik.com systemd[1]: Stopping Nomad Client Service...
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.803Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.803Z [INFO]  client.plugin: plugin manager finished: plugin-type=device
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.803Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.948Z [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.948Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.948Z [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Sep 21 11:24:14 s313.iad1.kentik.com nomad-client[2151136]:     2023-09-21T11:24:14.998Z [INFO]  agent: shutdown complete
Sep 21 11:24:15 s313.iad1.kentik.com systemd[1]: nomad-client.service: Main process exited, code=exited, status=1/FAILURE
Sep 21 11:24:15 s313.iad1.kentik.com systemd[1]: nomad-client.service: Failed with result 'exit-code'.
...
Restarted...

And after that it started up properly.

tgross commented 1 year ago

Hi @bernardoVale! The SIGQUIT causes a goroutine dump, so that's what I was looking for, so thanks for that. I'll get a chance to dig into this in detail later today.

But a quick first glance something jumps out at me: this node wasn't stuck because of a client restart, right? The dump shows a bunch of the core goroutines (like the signal handler itself) in "select" for 4231 minutes, which means that's how long the agent was running at the time it got the SIGQUIT. Did the server have it in the initializing state at that point?

Also, if you know roughly how long it was "stuck" at the point you took this, that might help narrow down the culprit. Even a resolution like "roughly 5 minutes" or "30 minutes" would be super helpful.

bernardoVale commented 1 year ago

But a quick first glance something jumps out at me: this node wasn't stuck because of a client restart, right?

No it did not restart, you can see the heartbeat failure log + logs of ACL failures (this is me trying to take a goroutine dump):

Sep 20 12:29:49 c111.iad1.kentik.com nomad-client[1944259]: ==> Newer Nomad version available: 1.6.2 (currently running: 1.6.1)
Sep 21 02:01:09 c111.iad1.kentik.com nomad-client[1944259]:     2023-09-21T02:01:09.489Z [WARN]  client: missed heartbeat: req_latency=7.228871192s heartbeat_ttl=17.513577006s since_last_heartbeat=24.743407106s
Sep 21 08:03:16 c111.iad1.kentik.com nomad-client[1944259]: ==> Newer Nomad version available: 1.6.2 (currently running: 1.6.1)
Sep 21 10:53:19 c111.iad1.kentik.com nomad-client[1944259]:     2023-09-21T10:53:19.392Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Agent.Profile server=10.2.1.97:4647
Sep 21 10:53:19 c111.iad1.kentik.com nomad-client[1944259]:     2023-09-21T10:53:19.392Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Agent.Profile server=10.2.1.97:4647
Sep 21 10:53:51 c111.iad1.kentik.com nomad-client[1944259]:     2023-09-21T10:53:51.617Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Agent.Profile server=10.3.109.25:4647
Sep 21 10:53:51 c111.iad1.kentik.com nomad-client[1944259]:     2023-09-21T10:53:51.618Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Agent.Profile server=10.3.109.25:4647
Sep 21 10:54:39 c111.iad1.kentik.com nomad-client[1944259]:     2023-09-21T10:54:39.509Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Agent.Profile server=10.3.107.19:4647
Sep 21 10:54:39 c111.iad1.kentik.com nomad-client[1944259]:     2023-09-21T10:54:39.509Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Agent.Profile server=10.3.107.19:4647

Also, if you know roughly how long it was "stuck" at the point you took this, that might help narrow down the culprit. Even a resolution like "roughly 5 minutes" or "30 minutes" would be super helpful.

We have a service collecting node status by issuing a blocking query on /v1/nodes. The first report was at 2023-09-21 02:02 UTC and it got unstuck at 11:06 UTC

lgfa29 commented 1 year ago

Hi @bernardoVale 👋

Do you have jobs with max_client_disconnect set? Or more specifically, do you see these nodes going to the disconnected status?

The reason I ask is because this is how we expect nodes to transition between all possible status: https://github.com/hashicorp/nomad/blob/e3c8700ded891702b7f94636033109efd8f71c3a/nomad/node_endpoint.go#L516-L534

So for a node to be stuck in the initializing status after previously being ready it should have transitioned to disconnected and then not update its allocations with the server 🤔

bernardoVale commented 1 year ago

Do you have jobs with max_client_disconnect set?

Yes @lgfa29, we are setting max_client_disconnect on pretty much all jobs.

Or more specifically, do you see these nodes going to the disconnected status?

I can't confirm if they are transitioning to disconnected, but we do see heartbeat failures every time a node gets stuck. So your theory makes sense to me.

lgfa29 commented 1 year ago

Sorry for the delay here @bernardoVale, Nomad 1.7 has kept us quite busy this past few weeks.

I've also looking into this in more details, but I still not sure how the client can get stuck in this situation.

Next time it happens, would you be able to collect DEBUG level logs from the client that is stuck? You can use the nomad monitor -log-level=DEBUG -node-id=... command to do this.

More specifically, I'm looking for lines containing messages like these:

allocation updates applied
updated allocations
allocation updates
allocation updates applied
error querying node allocations
error querying updated allocations
ailed to retrieve updated allocs; retrying
error reconnecting alloc
cannot update nonexistent alloc

If you could also monitor the cluster leader, it would be helpful to look for messages like this:

marking node as initializing due to outdated allocation information

But the more log you can collect and provide the better. Feel free to email them to us as well.

One more thing to confirm is that nomad node status returns any allocation for the node.

You did send an output earlier that seemed to indicate that the node didn't have any allocs, but I'm not sure if that was actually the case of if the output was just cut off. If you could retrieve that list and the alloc status that would be helpful.

With that said, I will dump my research here to help us track progress on this issue. Feel free to ignore everything below this (or follow along if you're curious 😄).

The process when a client reconnects to a server is quite complex. The happy path order of events would be something like this:

Client sends a Node.UpdateStatus request to the server indicating its status as ready.
Server keeps the node status as initializing because it needs to wait for the client to send the latest allocation status.
Client sends a Node.UpdateAlloc request to the server with the latest batch of changes.
Server updates the last time the client updated its allocations.
Client sends a Node.UpdateStatus request to the server indicating its status as ready again (this is periodic).
Server now transitions client to ready.

So, in order for the client to be stuck in initializing it must have been able to send a Node.UpdateStatus request, but it's not sending a Node.UpdateAlloc.

Another reason may be that the node just doesn't have any allocations, so there are no updates to send. But this case is already handled by the server. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/nomad/node_endpoint.go#L605-L608

Looking at the code, the only reason why a client wouldn't send Node.UpdateAlloc is if there's nothing to send. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2192-L2195

But this was something I had considered in the work I did in this area, and it didn't seem possible to not have an update here on reconnect.

The reason is that, at the same time the operations 1-6 above are happening, the client also fetches its allocations from the server, and any allocation with a modify index bigger than what the client currently has is considered to be updated. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2356-L2366

And my assumption is that there should always be at least one update because the allocation with max_client_disconnect has to have transitioned from running to unknown while the client was disconnected, so upon reconnect, the alloc modify index must have been updated.

Now, looking closer at the Client.watchAllocations() OUTER loop, there are several places where a continue may happen and prevented an alloc update to be registered. Some of them are failed RPC requests, which seem a bit unlikely be happening.

One interesting case is when Node.GetClientAllocs fails because of a no servers error. We unfortunately don't log when that happens, but it seems unlikely in this case since the client seems to have been able to call Node.UpdateStatus. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2299-L2326

The other continues would log one of the lines I sent above, except for this one which I'm not sure when it could happen. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2401-L2413

Then, we finally reach the updated allocations log line I mentioned above. If this line is logged, then it's guaranteed that we have an update to send to the server.

This update triggers a long chain of function calls that I will just link here without much context. This is a path I may need to investigate further to find ways it could be cut short.

The update calls runAllocs. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L1847-L1859

runAllocs diffs the allocation in its current state and what was received by the server. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2495-L2498

We expect the allocation to have had its modify index updated, so it should be in pulled and pass this condition, being added to updated. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/util.go#L44-L49

Updated allocs trigger a call to c.updateAlloc. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2507-L2510

Since the allocation is reconnecting, we expect Client.updateAlloc() to call ar.Reconnect(). https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2622-L2639

And ar.Reconnect() eventually generates a pending update. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/allocrunner/alloc_runner.go#L1377-L1378 https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2167

These pending updates are batched. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2187-L2190

And if we look at the code for pendingClientUpdates.nextBatch we see that there are cases where nil is returned. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L3397-L3419

The last one can be ignored because updateTicks are always incremented, so it's just used to skip 4/5 updates, meaning the 5th one will generate updates, and the first return happens when there are no updates, but we've just discussed how we do expect to have at least one update.

The second return is an interesting one, as it blocks updates from happening until the server is contacted. That channel is unblocked https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L3406-L3412

Which happens way back in runAllocs(), which, at this point, we should have already passed through. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L2532-L2536

The fact we've already unblocked this channel is further corroborated by the stack dump you sent us.

goroutine 1614 [select, 2 minutes]:
runtime.gopark(0xc001baff98?, 0x2?, 0xc0?, 0x84?, 0xc001baff74?)
        runtime/proc.go:381 +0xd6 fp=0xc001bafe00 sp=0xc001bafde0 pc=0x440796
runtime.selectgo(0xc001baff98, 0xc001baff70, 0x4?, 0x0, 0xc?, 0x1)
        runtime/select.go:327 +0x7be fp=0xc001baff40 sp=0xc001bafe00 pc=0x45103e
github.com/hashicorp/nomad/client.(*Client).run(0xc000671500)
        github.com/hashicorp/nomad/client/client.go:1847 +0xf3 fp=0xc001baffc8 sp=0xc001baff40 pc=0x1c3e393
github.com/hashicorp/nomad/client.NewClient.func8()
        github.com/hashicorp/nomad/client/client.go:582 +0x26 fp=0xc001baffe0 sp=0xc001baffc8 pc=0x1c35866
runtime.goexit()
        runtime/asm_amd64.s:1598 +0x1 fp=0xc001baffe8 sp=0xc001baffe0 pc=0x4754c1
created by github.com/hashicorp/nomad/client.NewClient
        github.com/hashicorp/nomad/client/client.go:582 +0x24ad

This goroutine has been sitting in this select for 2 minutes, but others have been running for thousands of minutes, so runAllocs did get called several times. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/client/client.go#L1847-L1859

The log lines mentioned earlier will help us narrow down the code path that is being executed (or not).

Looking back, this flow relies on two main assumptions:

When the client disconnects, allocations with max_client_disconnect have their ModifyIndex bumped by the servers. It seems like a fair assumption, and I have observed it in my tests.
The client retrieves this status change and queues an update to send back to the server. Also seems like a fair assumption, otherwise unknown allocs would never be updated, but I need to spend more time investigating this code path to find ways this could not happen.

The second assumption has some interesting possibilities to investigate, such as:

What happens if the unknown alloc is stopped and GC'ed from servers?
What happens if the unknown alloc crashes and is GC'ed by the client while it's still disconnected?
Is there some state corruption scenario that could cause it to not be true? A state restore gone bad, or it may even be in progress while all of this happens?

One way around this problem could be to have a mechanism where the server can force the client to send alloc updates if it's waiting for them, if no updates are pending.

I have thought about this in previous work, but give all of the above I thought it would not be necessary. Here's a draft commit of how that would look like: https://github.com/hashicorp/nomad/commit/395f6c64fabd5b52faaa13deff630380143ffad6. TestClient_Reconnect attempted to reproduce the issue, but it passes even without code changes (and it's somehow a bit flaky and the node is considered down instead of disconnected, not sure why 😅).

One question with this approach is what to sync? I opted for sending all allocs, since sending zero allocs is not supported by the endpoint. https://github.com/hashicorp/nomad/blob/515895c7690cdc72278018dc5dc58aca41204ccc/nomad/node_endpoint.go#L1337-L1340

But we could also send a single random alloc just to trigger a node update. I also thought about a special endpoint for this scenario, but adding more RPCs and complexity in this code path sounded like a bad idea.

bernardoVale commented 7 months ago

Hi @lgfa29. Sorry for the delay here. Since November, we added a pipeline that deploys a test service to all hosts. I don't know why, but we don't see clients getting stuck that often. This weekend, I had to disable the job for another reason, and we got one node stuck.

Next time it happens, would you be able to collect DEBUG level logs from the client that is stuck? You can use the nomad monitor -log-level=DEBUG -node-id=... command to do this.

stuck-node-debug.log

If you could also monitor the cluster leader, it would be helpful to look for messages like this: marking node as initializing due to outdated allocation information

The only log I found was this (66ad8402-d25f-6e97-f037-ba1b00f53059 is the stuck node):

Apr 13 15:15:03 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-13T15:15:03.926Z [INFO]  nomad.raft: starting snapshot up to: index=5126080
Apr 13 15:15:03 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-13T15:15:03.926Z [INFO]  snapshot: creating new snapshot: path=/data/service/nomad-server/data/server/raft/snapshots/48678-5126080-1713021303926.tmp
Apr 13 15:15:04 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-13T15:15:04.080Z [INFO]  snapshot: reaping snapshot: path=/data/service/nomad-server/data/server/raft/snapshots/48678-5109695-1712919225489
Apr 13 15:15:04 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-13T15:15:04.081Z [INFO]  nomad.raft: compacting logs: from=5107649 to=5115840
Apr 13 15:15:04 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-13T15:15:04.932Z [INFO]  nomad.raft: snapshot complete up to: index=5126080
Apr 14 01:48:18 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:18.050Z [WARN]  nomad.heartbeat: node TTL expired: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e
Apr 14 01:48:18 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:18.983Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=73f8388e-1d85-57>
Apr 14 01:48:18 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:18.983Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=54145bca-d7d1-a9>
Apr 14 01:48:18 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:18.983Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=53299602-8d3f-18>
Apr 14 01:48:18 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:18.983Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=4e306059-4e1f-72>
Apr 14 01:48:19 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:19.152Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=30b3b301-9b4f-41>
Apr 14 01:48:19 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:19.152Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=30b3b301-9b4f-41>
Apr 14 01:48:19 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:19.153Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=30b3b301-9b4f-41>
Apr 14 01:48:19 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:19.153Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=30b3b301-9b4f-41>
Apr 14 01:48:19 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T01:48:19.154Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e reason="node is not ready for placements" eval_id=30b3b301-9b4f-41>
Apr 14 03:08:26 s301.bct1.kentik.com nomad-server[1298451]:     2024-04-14T03:08:26.179Z [WARN]  nomad.heartbeat: node TTL expired: node_id=c8a0c1bf-360a-3ebf-016d-6a212824700e

One more thing to confirm is that nomad node status returns any allocation for the node.

Yes it reports all allocs:

ID              = 66ad8402-d25f-6e97-f037-ba1b00f53059
Name            = s402.bct1.kentik.com
Node Pool       = default
Class           = storage
DC              = bct1
Drain           = false
Eligibility     = eligible
Status          = initializing
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 11682h16m42s
Host Volumes    = blob-storage
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

Node Events
Time                  Subsystem  Message
2024-04-13T04:18:20Z  Cluster    Node heartbeat missed
2024-04-08T08:58:26Z  Cluster    Node heartbeat missed
2024-02-21T06:55:20Z  Cluster    Node heartbeat missed
2024-01-11T11:52:51Z  Cluster    Node heartbeat missed
2023-05-30T12:02:37Z  Cluster    Node re-registered
2023-05-30T11:47:21Z  Cluster    Node heartbeat missed
2022-11-22T14:25:33Z  Cluster    Node registered

Allocated Resources
CPU             Memory          Disk
780/196000 MHz  11 GiB/122 GiB  2.5 GiB/3.5 TiB

Allocation Resource Utilization
CPU              Memory
2121/196000 MHz  19 GiB/122 GiB

Host Resource Utilization
CPU              Memory           Disk
3833/196000 MHz  161 GiB/252 GiB  8.3 GiB/3.5 TiB

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created    Modified
d207d5d2  66ad8402  cd-functest         18677    run      running  2d13h ago  2d13h ago
28e95020  66ad8402  cd-functest-system  2370     run      running  2d13h ago  2d13h ago
2f83d2b4  66ad8402  kde-worker-a        18       run      running  6d7h ago   6d7h ago
36726adc  66ad8402  kde-worker-b        11       run      running  6d8h ago   6d8h ago
295e6345  66ad8402  kde-planner         52       run      running  9d11h ago  9d11h ago
c86e7381  66ad8402  kde-cache-subquery  5        run      running  10d8h ago  7d2h ago
8e6090e1  66ad8402  controlplane-agent  16       run      running  1mo3d ago  1mo3d ago
92f8e690  66ad8402  kt-cdb-viewer       2        run      running  2mo2d ago  7d2h ago
959001aa  66ad8402  apigw-envoy-be      21       run      running  5mo7d ago  1mo24d ago

As a I highlighted earlier, manually stopping an allocation unstuck the client. Here's the client debug log after I ran nomad alloc stop:

unstuck-node-debug.log

hashicorp / nomad