Open OfficerKoo opened 1 month ago
@OfficerKoo your logs are mangled such that timestamps don't line up (and in reverse). But it looks like they're showing the template runner restarted the task at 2024-05-21T11:12:56.853Z
, but then it was still trying to download the Docker image to start the task when the next re-render happened at 2024-05-21T11:13:03.623Z
. As in, it didn't restart the task for the second template re-render simply because it hadn't started it from a few seconds before yet.
Then I see a gap of 2 hours before there's a user-initiated restart. Most of what I can tell here from the logs looks right... Nomad can't restart a task that hasn't yet started. If I'm missing something, it might help if you could show this with debug-level logs that have consistent timestamps.
@tgross sorry about that, looks like logs got all twisted up when i were exporting them. You are right, looks like second re-render happens a second after Docker image downloading, what can we do to circumvent this? I already tried to add 1 minute splay and minimum 1 minute wait.
We will try to enable debug logs on one client, but this will happen next week probably.
You are right, looks like second re-render happens a second after Docker image downloading, what can we do to circumvent this? I already tried to add 1 minute splay and minimum 1 minute wait.
Ok, just wanted to make sure I was reading that correctly. The splay
should help with this, so long as the splay
length is longer than the window between the two re-renders + the task start.
Taking a quick pass over the code it looks like we send the Restart
method and then immediately return to the loop that's watching for re-renders. That Restart
method returns an error if the task isn't running... but the template re-render loop ignores the error so it's never reported! So at the very least there's a bug where we should be logging the error so that you as the operator know what the heck is going on. I suspect this may have been changed at some point, because now that I'm seeing this all together this looks like you could be running into what was described in https://github.com/hashicorp/nomad/issues/5459
But fixing the log wouldn't fix the underlying problem... we could drop re-render events such that the task start could race with any environment set by `template.env
= true` (as you've done here). I suspect we want to queue-up re-render events that we get from the template runner, ideally coalescing any that were received in the same restart window.
We will try to enable debug logs on one client, but this will happen next week probably.
Thanks! If you look at the logs and see anything you feel is too sensitive to share publicly, you can email them to nomad-oss-debug@hashicorp.com, which is only visible to members of the Nomad engineering team.
Hate to hijack the issue but I've seen an issue with workloads not receiving a signal after updates occasionally as well. In my case what happens is that I have a workload with dynamic certs. When Nomad agent is restarted templates with dynamic certs are always refreshed, but in some cases the signal event isn't triggered.
I see in the log:
2024-06-03T14:47:15.810Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=task-1 type=Received msg="Task received by client" failed=false
2024-06-03T14:47:15.810Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=task2 type=Received msg="Task received by client" failed=false
2024-06-03T14:47:16.965Z [INFO] agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/f8c245e8-c418-dc35-b3b3-43f558a41d74/task-1/secrets/server.key"
2024-06-03T14:47:16.976Z [INFO] agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/f8c245e8-c418-dc35-b3b3-43f558a41d74/task-2/secrets/server.crt"
As you can see I'm missing the expected re-rendered msg:
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=precursor type=Signaling msg="Template re-rendered" failed=false
It doesn't happen often but often enough to occasionally break production workloads due to expired certificates. I also haven't been able to reproduce by restarting the same agent that had this issue before.
The job template splay
is set to 0
so under normal circumstances a re-render would immediately trigger a signal.
@tgross Hi, so we enabled debug logs, but didn't catch anything suspicious, it does look that error happens when container restarts, but only suspicious part in logs is
2024-06-12 18:09:49.063 agent: (runner) checking template 8d86a9f697ea6be20429ce0f1cffc9ae
2024-06-12 18:09:49.063 agent: (runner) missing data for 4 dependencies
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(kv/data/ci/prod/env)
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(kv/data/ci/production/m2m)
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(prod-database/creds/selectors)
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(prod-database/creds/uploads)
2024-06-12 18:09:49.067 agent: (runner) add used dependency vault.read(kv/data/ci/prod/env) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.070 agent: (runner) add used dependency vault.read(kv/data/ci/production/m2m) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.070 agent: (runner) add used dependency vault.read(prod-database/creds/selectors) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.071 agent: (runner) add used dependency vault.read(prod-database/creds/uploads) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.071 agent: (runner) was not watching 4 dependencies
Which is happens right after final config is rendered.
I also noticed that in final config both wait
and splay
set to 0 or disabled, which is 100% not the case in job definition
I attached full logs, allocation id is 63378fde-1456-96b8-2fc5-09cbbb288624
, template id is 8d86a9f697ea6be20429ce0f1cffc9ae
Explore-logs-2024-06-14 16_12_32.txt
Nomad version
nomad_1.6.1
Operating system and Environment details
AWS, Amazon Linux 2, Production
Issue
We use Vault Nomad integration for database secrets in our application, sometimes Nomad re-renders the secrets in file, but fails to trigger restart, so env is not provisioned.
Usually 1 allocation from the task is affected, all jobs that use database secret engine are affected. restart fixes the problem.
Issue persists only on production, we failed to reproduce it on staging cluster
We checked some other issues, such as https://github.com/hashicorp/nomad/issues/6112 , but it does not fit our profile, as we don't have out jobs depending on other jobs, just straight up using vault database engine.
Reproduction steps
Expected Result
Actual Result
Job file (if appropriate)
Sample job, but all the jobs that use database secrets are affected
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)