Open ygersie opened 2 weeks ago
Hi @ygersie! One thing that complicates debugging this is that the template runner will report the "rendering" logs you show even if there's no change to the contents. See below for an example of where I've updated a Nomad Variable with a no-op change:
2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) receiving dependency nomad.var.block(nomad/jobs/example@default.global)
2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) initiating run
2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) checking template 99264c04aea457a2a707ad264732193f
2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) diffing and updating dependencies
2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) nomad.var.block(nomad/jobs/example@default.global) is still needed
2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) watching 1 dependencies
2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) all templates rendered
2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) received template "99264c04aea457a2a707ad264732193f" from quiescence
2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) initiating run
2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) checking template 99264c04aea457a2a707ad264732193f
2024-10-07T11:12:53.720-0400 [DEBUG] agent: (runner) rendering "(dynamic)" => "/var/nomad/data/alloc/d222bce3-4bd6-2029-a6a5-742f3e662595/http/local/test.txt"
2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) diffing and updating dependencies
2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) nomad.var.block(nomad/jobs/example@default.global) is still needed
2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) watching 1 dependencies
2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) all templates rendered
2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) enabling global quiescence for "99264c04aea457a2a707ad264732193f"
What's happening here is the following workflow:
template.go#L302
) and that triggers the change_mode
.But assuming that you know the template contents did change, there aren't currently any logs that could capture the difference between "we got the event but didn't trigger the change" and "we didn't get an event". That this is happening around client restart is suspicious because of the notion that we check if the template has previously rendered.
Just in case there's a clue here... is the missing rendering happening immediately following client restart or some time afterwards?
@tgross just so you're aware, I've been able to reproduce after a ton of restarts in nonprod and posted the TRACE logs along with my jobspec in the Enterprise support ticket. What's important to know here is that I'm using a dynamic PKI certificate. Each time you restart Nomad agent this triggers a new certificate as Nomad doesn't introspect the already rendered certificate on the filesystem to determine a new Vault lease time for renewal. It therefore should always trigger a restart or a signal for each allocation with a dynamic PKI cert. That is the problem here, it's not. Every so often a signal/restart is never triggered even though the actual certificate on the filesystem has been renewed. We then end up with a situation of a workload that has a new certificate on the filesystem but the one in runtime is never refreshed so it expires.
Also important to note, this seems to only happen after an agent restart. We have not seen this occurring during "normal" operations where a certificate is renewed every < 10 days.
I haven't yet been able to reproduce but I do have some very preliminary findings. In the logs you've provided, the allocation of interest is 80f6cc3d
. The node is running several other allocations, and this makes it challenging to de-tangle the template runner logs, which don't include the allocation ID. Fortunately, the allocation in question has at least one unique Vault path in its template so that makes it possible to pick it out. The allocation has two dependencies:
vault.read(credentials/data/ca/chain/default)
vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f)
First we see that both the update hook and prestart hook run concurrently:
2024-10-07T15:46:10.067Z [TRACE] client.alloc_runner.task_runner: running update hooks: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test start="2024-10-07 15:46:10.067840601 +0000 UTC m=+10.315820980" 2024-10-07T15:46:10.067Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template start="2024-10-07 15:46:10.06789509 +0000 UTC m=+10.315875468"
All the template hook's top-level methods take a lock (ex. template_hook.go#L116-L117
) so they can't interfere with each other. The prestart hook blocks until the first event is handled. In this case, we see that the update hook lost the race, because it exits last. So this appears to be a red herring:
2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="2024-10-07 15:46:10.14306976 +0000 UTC m=+10.391050138" duration=75.17467ms ... 2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="2024-10-07 15:46:10.143131204 +0000 UTC m=+10.391111585" duration=75.092735ms
Next, I've extracted what looks to be the relevant bits of the logs for this template here:
What we can see from these is that we fetch the secret once from Vault, but then we have multiple rendering/rendered logs for the two destination files. The last "rending" doesn't have a paired "rendered":
runner.go#L887
result.DidRender
at runner.go#L926
.TemplateRenderedCh
at template.go#L314-L316
. Just below that line you can see where we fire the on-rendered event if the rendered event is "dirty".runner.go#L638
, which means that the last event received determines if the template state is dirty, regardless of how many events were actually produced.So my hypothesis is that:
The reason we're seeing this during client restart is because we never trigger the on-render events for the first time we've rendered the template, so we don't care if there are multiple events. I haven't yet determined why there are multiple events and whether it's possible for them to appear outside of client restarts.
All of this also aligns with an issue reported back in October 2022 https://github.com/hashicorp/nomad/issues/15057 that was never verified.
I'm going to huddle up with the rest of the engineering team to try to figure out the best path forward towards figuring out a fix.
@tgross since I can reproduce (although sometimes it takes a while) do you have a recommendation to improve visibility on what's going on?
@tgross since I can reproduce (although sometimes it takes a while) do you have a recommendation to improve visibility on what's going on?
I think we're good on more information from your environment at this point, but thanks for offering. Right now @schmichael is spending some time working on reproducing more reliably and will report back once we know more. Thanks!
Nomad version
1.8.2+ent
Issue
Every time we restart a large fleet of clients some allocations end up with rendered dynamic certificates but they haven't been signaled/restarted despite having a proper
change_mode
set in the jobspec. This leads to production incidents where secrets have expired from a workload perspective (didn't get a signal or restart) even though the certificates on the filesystem have been updated.Reproduction steps
I haven't been able to reproduce this in an isolated manner. It does happen every time on just a few allocations across our entire cluster after client restarts. On the same node there are many other allocations that get signaled/restarted just fine.
Logs
When issuing dynamic certificates from Vault through the template stanza a new certificate is created on each client restart. This should always trigger a
change_mode
event but occasionally that does not happen for some allocations which leads to outages. Example here from a client log snippet:As you can see the certificate has been rendered correctly but a
re-rendered
event never triggered or isn't shown in the log file. For this workloadchange_mode="restart"
is set and we should therefore have seen:But it's not there.