hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Template change_mode intermittently ignored after client restart #24140

Open ygersie opened 2 weeks ago

ygersie commented 2 weeks ago

Nomad version

1.8.2+ent

Issue

Every time we restart a large fleet of clients some allocations end up with rendered dynamic certificates but they haven't been signaled/restarted despite having a proper change_mode set in the jobspec. This leads to production incidents where secrets have expired from a workload perspective (didn't get a signal or restart) even though the certificates on the filesystem have been updated.

Reproduction steps

I haven't been able to reproduce this in an isolated manner. It does happen every time on just a few allocations across our entire cluster after client restarts. On the same node there are many other allocations that get signaled/restarted just fine.

Logs

When issuing dynamic certificates from Vault through the template stanza a new certificate is created on each client restart. This should always trigger a change_mode event but occasionally that does not happen for some allocations which leads to outages. Example here from a client log snippet:

mynodename:~# zgrep '1f514495-e45b-1428-2ab6-b4f2c67e9a0e' /var/log/nomad.log-*.gz
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:24.428Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1f514495-e45b-1428-2ab6-b4f2c67e9a0e task=mytask type=Received msg="Task received by client" failed=false
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:24.934Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=1f514495-e45b-1428-2ab6-b4f2c67e9a0e task=mytask path=/data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/api.sock error="listen unix /data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/api.sock: bind: invalid argument"
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:25.119Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/client.pem"
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:25.135Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/server.pem"

As you can see the certificate has been rendered correctly but a re-rendered event never triggered or isn't shown in the log file. For this workload change_mode="restart" is set and we should therefore have seen:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1f514495-e45b-1428-2ab6-b4f2c67e9a0e task=mytask type="Restart Signaled" msg="Template with change_mode restart re-rendered" failed=false

But it's not there.

tgross commented 2 weeks ago

Hi @ygersie! One thing that complicates debugging this is that the template runner will report the "rendering" logs you show even if there's no change to the contents. See below for an example of where I've updated a Nomad Variable with a no-op change:

    2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) receiving dependency nomad.var.block(nomad/jobs/example@default.global)
    2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) initiating run
    2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) checking template 99264c04aea457a2a707ad264732193f
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) diffing and updating dependencies
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) nomad.var.block(nomad/jobs/example@default.global) is still needed
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) watching 1 dependencies
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) all templates rendered
    2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) received template "99264c04aea457a2a707ad264732193f" from quiescence
    2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) initiating run
    2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) checking template 99264c04aea457a2a707ad264732193f
    2024-10-07T11:12:53.720-0400 [DEBUG] agent: (runner) rendering "(dynamic)" => "/var/nomad/data/alloc/d222bce3-4bd6-2029-a6a5-742f3e662595/http/local/test.txt"
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) diffing and updating dependencies
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) nomad.var.block(nomad/jobs/example@default.global) is still needed
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) watching 1 dependencies
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) all templates rendered
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) enabling global quiescence for "99264c04aea457a2a707ad264732193f"

What's happening here is the following workflow:

But assuming that you know the template contents did change, there aren't currently any logs that could capture the difference between "we got the event but didn't trigger the change" and "we didn't get an event". That this is happening around client restart is suspicious because of the notion that we check if the template has previously rendered.

Just in case there's a clue here... is the missing rendering happening immediately following client restart or some time afterwards?

ygersie commented 2 weeks ago

@tgross just so you're aware, I've been able to reproduce after a ton of restarts in nonprod and posted the TRACE logs along with my jobspec in the Enterprise support ticket. What's important to know here is that I'm using a dynamic PKI certificate. Each time you restart Nomad agent this triggers a new certificate as Nomad doesn't introspect the already rendered certificate on the filesystem to determine a new Vault lease time for renewal. It therefore should always trigger a restart or a signal for each allocation with a dynamic PKI cert. That is the problem here, it's not. Every so often a signal/restart is never triggered even though the actual certificate on the filesystem has been renewed. We then end up with a situation of a workload that has a new certificate on the filesystem but the one in runtime is never refreshed so it expires.

ygersie commented 2 weeks ago

Also important to note, this seems to only happen after an agent restart. We have not seen this occurring during "normal" operations where a certificate is renewed every < 10 days.

tgross commented 1 week ago

I haven't yet been able to reproduce but I do have some very preliminary findings. In the logs you've provided, the allocation of interest is 80f6cc3d. The node is running several other allocations, and this makes it challenging to de-tangle the template runner logs, which don't include the allocation ID. Fortunately, the allocation in question has at least one unique Vault path in its template so that makes it possible to pick it out. The allocation has two dependencies:

First we see that both the update hook and prestart hook run concurrently:

2024-10-07T15:46:10.067Z [TRACE] client.alloc_runner.task_runner: running update hooks: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test start="2024-10-07 15:46:10.067840601 +0000 UTC m=+10.315820980" 2024-10-07T15:46:10.067Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template start="2024-10-07 15:46:10.06789509 +0000 UTC m=+10.315875468"

All the template hook's top-level methods take a lock (ex. template_hook.go#L116-L117) so they can't interfere with each other. The prestart hook blocks until the first event is handled. In this case, we see that the update hook lost the race, because it exits last. So this appears to be a red herring:

2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="2024-10-07 15:46:10.14306976 +0000 UTC m=+10.391050138" duration=75.17467ms ... 2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="2024-10-07 15:46:10.143131204 +0000 UTC m=+10.391111585" duration=75.092735ms

Next, I've extracted what looks to be the relevant bits of the logs for this template here:

logs > 2024-10-07T15:46:10.070Z [INFO] agent: (runner) starting > 2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) running initial templates > 2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) initiating run > 2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) missing data for 1 dependencies > 2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) missing dependency: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) > 2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) add used dependency vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) to missing since isLeader but do not have a watcher > 2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) was not watching 1 dependencies > 2024-10-07T15:46:10.070Z [DEBUG] agent: (watcher) adding vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) > 2024-10-07T15:46:10.070Z [TRACE] agent: (watcher) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) starting > 2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) checking template 7a1c81c5aeaad195d4d0bd825982c12f > 2024-10-07T15:46:10.070Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) starting fetch > 2024-10-07T15:46:10.071Z [TRACE] agent: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f): PUT /v1/pki/issue/nomad-cie-orch-ygersie-test?stale=true&wait=5m0s > 2024-10-07T15:46:10.071Z [DEBUG] agent: (runner) missing data for 1 dependencies > 2024-10-07T15:46:10.071Z [DEBUG] agent: (runner) missing dependency: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) > 2024-10-07T15:46:10.071Z [DEBUG] agent: (runner) missing data for 1 dependencies > ... > 2024-10-07T15:46:10.091Z [TRACE] agent: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f): non-renewable secret, set sleep for 52m18.608832275s > 2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) marking successful data response > 2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) successful contact, resetting retries > 2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) received data > 2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) starting fetch > 2024-10-07T15:46:10.091Z [DEBUG] agent: (runner) receiving dependency vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) > 2024-10-07T15:46:10.091Z [DEBUG] agent: (runner) initiating run > ... > 2024-10-07T15:46:10.101Z [INFO] agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.crt" > ... > 2024-10-07T15:46:10.102Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.key" > ... > 2024-10-07T15:46:10.112Z [INFO] agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.key" > 2024-10-07T15:46:10.112Z [DEBUG] agent: (runner) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) is still needed > ... > 2024-10-07T15:46:10.112Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.crt" > ... > 2024-10-07T15:46:10.122Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.key" > ... > 2024-10-07T15:46:10.142Z [DEBUG] agent: (runner) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) is still needed > ... > 2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="> 2024-10-07 15:46:10.14306976 +0000 UTC m=+10.391050138" duration=75.17467ms

What we can see from these is that we fetch the secret once from Vault, but then we have multiple rendering/rendered logs for the two destination files. The last "rending" doesn't have a paired "rendered":

So my hypothesis is that:

The reason we're seeing this during client restart is because we never trigger the on-render events for the first time we've rendered the template, so we don't care if there are multiple events. I haven't yet determined why there are multiple events and whether it's possible for them to appear outside of client restarts.

All of this also aligns with an issue reported back in October 2022 https://github.com/hashicorp/nomad/issues/15057 that was never verified.

I'm going to huddle up with the rest of the engineering team to try to figure out the best path forward towards figuring out a fix.

ygersie commented 1 week ago

@tgross since I can reproduce (although sometimes it takes a while) do you have a recommendation to improve visibility on what's going on?

tgross commented 1 week ago

@tgross since I can reproduce (although sometimes it takes a while) do you have a recommendation to improve visibility on what's going on?

I think we're good on more information from your environment at this point, but thanks for offering. Right now @schmichael is spending some time working on reproducing more reliably and will report back once we know more. Thanks!