logmon disappears when rolling Nomad TLS certificate

vincenthuynh commented 6 months ago

Nomad version

v1.5.6

Operating system and Environment details

Debian

Issue

After rolling the TLS certificates in our Nomad cluster, all allocations have stopped logging. This is observed in the UI and using nomad alloc logs <alloc id>.

The logmon and docker_logging process disappear on client VMs when the new certificate is reloaded via SIGHUP.

The workaround is to restart the allocations or tasks.

Reproduction steps

Update client TLS certificates

Expected Result

ps afx
...
 2959 ?        Ssl  2210:22 /usr/local/bin/nomad agent -config /etc/nomad
 3241 ?        Sl     6:12  \_ /usr/local/nomad/1.5.6/nomad logmon
 3300 ?        Sl     0:10  \_ /usr/local/nomad/1.5.6/nomad docker_logger
 4948 ?        Sl     5:59  \_ /usr/local/nomad/1.5.6/nomad logmon
 4949 ?        Sl     5:54  \_ /usr/local/nomad/1.5.6/nomad logmon
...

Actual Result

ps afx
...
 2953 ?        Ssl  2711:36 /usr/local/bin/nomad agent -config /etc/nomad
...

Observed on each client. Likely related to making RPC calls and could no longer work:

client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=<leader ip>:4647

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The following log patterns were found on the clients:

client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/usr/local/nomad/1.5.6/nomad pid=3791 error="signal: hangup"

client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=<redacted> task=<redacted> path=/usr/local/nomad/1.5.6/nomad pid=11129 error="signal: hangup"

shoenig commented 5 months ago

Hi @vincenthuynh, how are you issuing the SIGHUP signal?

Are you specifying the nomad parent process ID? e.g.

➜ sudo kill -SIGHUP <pid>

Or are you using a tool like pkill to kill any process of name nomad?, e.g.

➜ sudo pkill -SIGHUP nomad

The later will send SIGHUP to the logging processes because they are also the nomad executable just executed with different args and would result in the behavior you're seeing.

vincenthuynh commented 5 months ago

Hi @shoenig, Thanks for following up!

We are running systemctl kill -s HUP nomad.service which would have the same behaviour as the latter example. We'll update our ansible playbook to specify the parent process.

I'll close this issue.

hashicorp / nomad