Open mpass99 opened 2 months ago
systemctl start nomad.service
systemctl restart nomad.service
The above-mentioned warnings appear. However, Nomad accepts them as discussed in hashicorp/nomad#17134 to allow in-place upgrades.
Systemd and the Nomad Process does not wait for the graceful shutdown deadline. Instead, Nomad restarts directly. As described in the linked Issue, the process of handling the workload remains running. When starting again, Nomad takes again control of the workload and stops it when the deadline is reached.
Starting Restart `Mon Sep 9 13:59:50 UTC 2024` ``` log 2024-09-09T13:59:50.853969+00:00 nomad-agent-terraform-3 systemd[1]: Stopping nomad.service - Nomad... 2024-09-09T13:59:50.854424+00:00 nomad-agent-terraform-3 nomad[2008]: ==> Caught signal: interrupt 2024-09-09T13:59:50.854528+00:00 nomad-agent-terraform-3 nomad[2008]: ==> Gracefully shutting down agent... 2024-09-09T13:59:50.863290+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:50.863Z [INFO] client.drain: monitoring self-drain 2024-09-09T13:59:55.854929+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.854Z [INFO] agent: requesting shutdown 2024-09-09T13:59:55.855082+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.854Z [INFO] client: shutting down 2024-09-09T13:59:55.857090+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.856Z [INFO] client.plugin: shutting down plugin manager: plugin-type=device 2024-09-09T13:59:55.857148+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.856Z [INFO] client.plugin: plugin manager finished: plugin-type=device 2024-09-09T13:59:55.857211+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.856Z [INFO] client.plugin: shutting down plugin manager: plugin-type=driver 2024-09-09T13:59:55.864934+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.864Z [INFO] client.plugin: plugin manager finished: plugin-type=driver 2024-09-09T13:59:55.864986+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.864Z [INFO] client.plugin: shutting down plugin manager: plugin-type=csi 2024-09-09T13:59:55.865024+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.864Z [INFO] client.plugin: plugin manager finished: plugin-type=csi 2024-09-09T13:59:55.876838+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.876Z [INFO] agent: shutdown complete 2024-09-09T13:59:55.883507+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE 2024-09-09T13:59:55.883672+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Failed with result 'exit-code'. 2024-09-09T13:59:55.883804+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Unit process 4096 (nomad) remains running after unit stopped. 2024-09-09T13:59:55.883882+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Unit process 4174 (nomad) remains running after unit stopped. 2024-09-09T13:59:55.884372+00:00 nomad-agent-terraform-3 systemd[1]: Stopped nomad.service - Nomad. 2024-09-09T13:59:55.884488+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Consumed 12.428s CPU time, 48.3M memory peak, 0B memory swap peak. 2024-09-09T13:59:55.885765+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Found left-over process 4096 (nomad) in control group while starting unit. Ignoring. 2024-09-09T13:59:55.885858+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies. 2024-09-09T13:59:55.886000+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Found left-over process 4174 (nomad) in control group while starting unit. Ignoring. 2024-09-09T13:59:55.886079+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies. 2024-09-09T13:59:55.891120+00:00 nomad-agent-terraform-3 systemd[1]: Starting nomad.service - Nomad... 2024-09-09T13:59:55.945034+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Loaded configuration from /etc/nomad.d/nomad.hcl 2024-09-09T13:59:55.945199+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Starting Nomad agent... 2024-09-09T13:59:57.052896+00:00 nomad-agent-terraform-3 nomad[4243]: 2024-09-09T13:59:57.052Z [TRACE] plugin.stdio: waiting for stdio data 2024-09-09T13:59:57.058520+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Nomad agent configuration: 2024-09-09T13:59:57.058583+00:00 nomad-agent-terraform-3 nomad[4243]: Advertise Addrs: HTTP: 10.224.6.145:4646 2024-09-09T13:59:57.058603+00:00 nomad-agent-terraform-3 nomad[4243]: Bind Addrs: HTTP: [[::]:4646] 2024-09-09T13:59:57.058619+00:00 nomad-agent-terraform-3 nomad[4243]: Client: true 2024-09-09T13:59:57.058638+00:00 nomad-agent-terraform-3 nomad[4243]: Log Level: INFO 2024-09-09T13:59:57.058666+00:00 nomad-agent-terraform-3 nomad[4243]: Region: global (DC: dc1) 2024-09-09T13:59:57.058691+00:00 nomad-agent-terraform-3 nomad[4243]: Server: false 2024-09-09T13:59:57.058706+00:00 nomad-agent-terraform-3 nomad[4243]: Version: 1.8.3 2024-09-09T13:59:57.058724+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Nomad agent started! 2024-09-09T13:59:57.063703+00:00 nomad-agent-terraform-3 nomad[4243]: 2024-09-09T13:59:57.041Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=fe1edbdd-d954-17cc-0bcd-9aec851fa87e task=default-task type=Received msg="Task received by client" failed=false 2024-09-09T14:00:06.886456+00:00 nomad-agent-terraform-3 nomad[4243]: 2024-09-09T14:00:06.886Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=fe1edbdd-d954-17cc-0bcd-9aec851fa87e task=default-task type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false ```
Since we have leave_on_interrupt
, leave_on_terminate
, and drain_on_shutdown
configured, I would expect an actual shutdown of all Nomad-related processes (that's how I understood the issue description). However, that's not case, is it?
I would expect an actual shutdown of all Nomad-related processes (that's how I understood the issue description
I agree. That is also what the docs state.
If this value is set to true on a client agent and the client is configured with drain_on_shutdown, the client will drain its workloads before shutting down.
However, that's not the case. A restart of Nomad on an Agent takes about 6 seconds. Some subprocesses keep running after Nomad stopped and the draining is handled after Nomad started again.
Systemctl
seems to be not the cause for this as it would wait up to 90 seconds by default [1]. TimeoutStopSec
is not overwritten in Nomad's service definition.
Maybe this question will resolve in the context of hashicorp/nomad#23937.
We are waiting for a response in the upstream issue. Until then, we are blocked and this issue is pending.
Related to #612.
When restarting a Nomad Agents, we find many systemd warnings.
nomad.service: Unit process 3620 (nomad) remains running after unit stopped.
nomad.service: Found left-over process 3620 (nomad) in control group while starting unit. Ignoring.
nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.