openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
9 stars 1 forks source link

Nomad Agent Restart Processes #677

Open mpass99 opened 2 months ago

mpass99 commented 2 months ago

Related to #612.

When restarting a Nomad Agents, we find many systemd warnings.

  1. Create minimal reproduction steps
  2. Investigate Nomads Graceful Shutdown behavior and if the systemd restart waits for it
  3. Create an upstream issue if appropriate
mpass99 commented 2 months ago

Minimal Reproduction

  1. Setup Ubuntu Instance
  2. Install Docker https://docs.docker.com/engine/install/ubuntu/
  3. Install Nomad https://developer.hashicorp.com/nomad/install
  4. Start Nomad systemctl start nomad.service
  5. Start some Job
  6. Restart Nomad systemctl restart nomad.service

(No) Issue

The above-mentioned warnings appear. However, Nomad accepts them as discussed in hashicorp/nomad#17134 to allow in-place upgrades.

Nomads Graceful Shutdown behavior

Systemd and the Nomad Process does not wait for the graceful shutdown deadline. Instead, Nomad restarts directly. As described in the linked Issue, the process of handling the workload remains running. When starting again, Nomad takes again control of the workload and stops it when the deadline is reached.

Nomad Logs

Starting Restart `Mon Sep 9 13:59:50 UTC 2024` ``` log 2024-09-09T13:59:50.853969+00:00 nomad-agent-terraform-3 systemd[1]: Stopping nomad.service - Nomad... 2024-09-09T13:59:50.854424+00:00 nomad-agent-terraform-3 nomad[2008]: ==> Caught signal: interrupt 2024-09-09T13:59:50.854528+00:00 nomad-agent-terraform-3 nomad[2008]: ==> Gracefully shutting down agent... 2024-09-09T13:59:50.863290+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:50.863Z [INFO] client.drain: monitoring self-drain 2024-09-09T13:59:55.854929+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.854Z [INFO] agent: requesting shutdown 2024-09-09T13:59:55.855082+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.854Z [INFO] client: shutting down 2024-09-09T13:59:55.857090+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.856Z [INFO] client.plugin: shutting down plugin manager: plugin-type=device 2024-09-09T13:59:55.857148+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.856Z [INFO] client.plugin: plugin manager finished: plugin-type=device 2024-09-09T13:59:55.857211+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.856Z [INFO] client.plugin: shutting down plugin manager: plugin-type=driver 2024-09-09T13:59:55.864934+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.864Z [INFO] client.plugin: plugin manager finished: plugin-type=driver 2024-09-09T13:59:55.864986+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.864Z [INFO] client.plugin: shutting down plugin manager: plugin-type=csi 2024-09-09T13:59:55.865024+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.864Z [INFO] client.plugin: plugin manager finished: plugin-type=csi 2024-09-09T13:59:55.876838+00:00 nomad-agent-terraform-3 nomad[2008]: 2024-09-09T13:59:55.876Z [INFO] agent: shutdown complete 2024-09-09T13:59:55.883507+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE 2024-09-09T13:59:55.883672+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Failed with result 'exit-code'. 2024-09-09T13:59:55.883804+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Unit process 4096 (nomad) remains running after unit stopped. 2024-09-09T13:59:55.883882+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Unit process 4174 (nomad) remains running after unit stopped. 2024-09-09T13:59:55.884372+00:00 nomad-agent-terraform-3 systemd[1]: Stopped nomad.service - Nomad. 2024-09-09T13:59:55.884488+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Consumed 12.428s CPU time, 48.3M memory peak, 0B memory swap peak. 2024-09-09T13:59:55.885765+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Found left-over process 4096 (nomad) in control group while starting unit. Ignoring. 2024-09-09T13:59:55.885858+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies. 2024-09-09T13:59:55.886000+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Found left-over process 4174 (nomad) in control group while starting unit. Ignoring. 2024-09-09T13:59:55.886079+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies. 2024-09-09T13:59:55.891120+00:00 nomad-agent-terraform-3 systemd[1]: Starting nomad.service - Nomad... 2024-09-09T13:59:55.945034+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Loaded configuration from /etc/nomad.d/nomad.hcl 2024-09-09T13:59:55.945199+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Starting Nomad agent... 2024-09-09T13:59:57.052896+00:00 nomad-agent-terraform-3 nomad[4243]: 2024-09-09T13:59:57.052Z [TRACE] plugin.stdio: waiting for stdio data 2024-09-09T13:59:57.058520+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Nomad agent configuration: 2024-09-09T13:59:57.058583+00:00 nomad-agent-terraform-3 nomad[4243]: Advertise Addrs: HTTP: 10.224.6.145:4646 2024-09-09T13:59:57.058603+00:00 nomad-agent-terraform-3 nomad[4243]: Bind Addrs: HTTP: [[::]:4646] 2024-09-09T13:59:57.058619+00:00 nomad-agent-terraform-3 nomad[4243]: Client: true 2024-09-09T13:59:57.058638+00:00 nomad-agent-terraform-3 nomad[4243]: Log Level: INFO 2024-09-09T13:59:57.058666+00:00 nomad-agent-terraform-3 nomad[4243]: Region: global (DC: dc1) 2024-09-09T13:59:57.058691+00:00 nomad-agent-terraform-3 nomad[4243]: Server: false 2024-09-09T13:59:57.058706+00:00 nomad-agent-terraform-3 nomad[4243]: Version: 1.8.3 2024-09-09T13:59:57.058724+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Nomad agent started! 2024-09-09T13:59:57.063703+00:00 nomad-agent-terraform-3 nomad[4243]: 2024-09-09T13:59:57.041Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=fe1edbdd-d954-17cc-0bcd-9aec851fa87e task=default-task type=Received msg="Task received by client" failed=false 2024-09-09T14:00:06.886456+00:00 nomad-agent-terraform-3 nomad[4243]: 2024-09-09T14:00:06.886Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=fe1edbdd-d954-17cc-0bcd-9aec851fa87e task=default-task type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false ```

MrSerth commented 1 month ago

Since we have leave_on_interrupt, leave_on_terminate, and drain_on_shutdown configured, I would expect an actual shutdown of all Nomad-related processes (that's how I understood the issue description). However, that's not case, is it?

mpass99 commented 1 month ago

I would expect an actual shutdown of all Nomad-related processes (that's how I understood the issue description

I agree. That is also what the docs state.

If this value is set to true on a client agent and the client is configured with drain_on_shutdown, the client will drain its workloads before shutting down.

However, that's not the case. A restart of Nomad on an Agent takes about 6 seconds. Some subprocesses keep running after Nomad stopped and the draining is handled after Nomad started again.

Systemctl seems to be not the cause for this as it would wait up to 90 seconds by default [1]. TimeoutStopSec is not overwritten in Nomad's service definition.

Maybe this question will resolve in the context of hashicorp/nomad#23937.

MrSerth commented 1 month ago

We are waiting for a response in the upstream issue. Until then, we are blocked and this issue is pending.