Sometimes template stanza does not reload allocations

OfficerKoo commented 1 month ago

Nomad version

nomad_1.6.1

Operating system and Environment details

AWS, Amazon Linux 2, Production

Issue

We use Vault Nomad integration for database secrets in our application, sometimes Nomad re-renders the secrets in file, but fails to trigger restart, so env is not provisioned.

Usually 1 allocation from the task is affected, all jobs that use database secret engine are affected. restart fixes the problem.

Issue persists only on production, we failed to reproduce it on staging cluster

We checked some other issues, such as https://github.com/hashicorp/nomad/issues/6112 , but it does not fit our profile, as we don't have out jobs depending on other jobs, just straight up using vault database engine.

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Sample job, but all the jobs that use database secrets are affected

job "analytics-svc" {
  namespace   = "production"
  type        = "service"
  datacenters = ["redacted"]

  update {
    max_parallel      = 1
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = true
    canary            = 0
  }

  migrate {
    max_parallel     = 1
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "analytics" {
    count = 2

    network {
      port "http" {
        to = 80
      }
    }

    scaling {
      enabled = true
      min     = local.scaled_min
      max     = 2

      policy {
        cooldown            = "2m"
        evaluation_interval = "1m"

        check "avg_allocs_cpu_usage_percentage" {
          source = "prometheus"
          query  = "avg(avg_over_time(nomad_client_allocs_cpu_total_ticks{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) / avg(avg_over_time(nomad_client_allocs_cpu_allocated{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) * 100"

          strategy "target-value" {
            target = 70
          }
        }

        check "avg_allocs_memory_usage_percentage" {
          source = "prometheus"
          query  = "avg(avg_over_time(nomad_client_allocs_memory_usage{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) / avg(avg_over_time(nomad_client_allocs_memory_allocated{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) * 100"

          strategy "target-value" {
            target = 70
          }
        }

      }
    }

    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    task "service" {
      vault {
        policies = ["service"]
      }

      driver = "docker"

      config {
        image      = "redacted"
        force_pull = true

        auth {
          username       = "redacted"
          password       = "redacted"
          server_address = "registry.gitlab.com"
        }

        ports = ["http"]

        labels {
          environment = "production"
          service     = "${NOMAD_GROUP_NAME}"
        }

        logging {
          type = "fluentd"

          config {
            tag                   = "${NOMAD_TASK_NAME}.${NOMAD_JOB_NAME}"
            labels                = "environment,service"
            fluentd-address       = "tcp://127.0.0.1:9997"
            fluentd-async-connect = true
          }
        }
      }

      env {
        APP_ALLOWED_ORIGINS = "*"
        ROUTING_URL         = "redacted"
        APOLLO_GRAPH_REF    = "redacted"
        DB_HOST             = "redacted"
        DB_DATABASE         = "analytics"
        DB_STAGING_DATABASE = "analytics_staging"
      }

      template {
        change_mode = "restart"
        destination = "secrets/.env"
        env         = true
        splay       = "1m"
        wait {
          min = "1m"
        }

        data = <<EOF
{{ with secret "kv/data/ci/prod/env" }}
AUTH0_AUDIENCE={{ .Data.data.AUTH0_AUDIENCE }}
AUTH0_DOMAIN={{ .Data.data.AUTH0_DOMAIN }}
APOLLO_KEY={{ .Data.data.APOLLO_KEY }}{{end}}

{{ with secret "kv/data/ci/production/m2m" }}
MACHINE_AUTH_TOKEN={{ .Data.data.MACHINE_AUTH_TOKEN }}{{ end }}

{{ with secret "prod-database/creds/analytics" }}
DB_USER={{ .Data.username }}
DB_PASSWORD={{ .Data.password }}{{ end }}

EOF
      }

      resources {
        cpu    = 400
        memory = 250
      }

      service {
        name = "${NOMAD_JOB_NAME}"
        port = "http"

        tags = [
          "${NOMAD_NAMESPACE}",
          "${NOMAD_GROUP_NAME}",
          "${NOMAD_TASK_NAME}",
          "urlprefix-/production/${NOMAD_GROUP_NAME} strip=/production/${NOMAD_GROUP_NAME}",
          "scrape_service",
        ]

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

2024-05-21 14:36:07.621 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 14:36:07.620 
2024-05-21T12:36:07.620Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 14:36:05.623 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 14:36:05.622 
2024-05-21T12:36:05.622Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 14:36:05.617 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 14:36:05.617 
2024-05-21T12:36:05.615Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 14:36:05.616 
2024-05-21T12:36:05.615Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 14:36:05.616 
 client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 14:36:05.601 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 14:36:05.600 
2024-05-21T12:36:05.600Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 14:36:00.214 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="User requested running tasks to restart" failed=false
2024-05-21 14:36:00.213 
2024-05-21T12:36:00.211Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="User requested running tasks to restart" failed=false
2024-05-21 13:13:04.863 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 13:13:04.862 
2024-05-21T11:13:04.862Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 13:13:03.624 
 agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"
2024-05-21 13:13:03.623 
2024-05-21T11:13:03.623Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"
2024-05-21 13:13:02.395 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 13:13:02.395 
2024-05-21T11:13:02.395Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 13:13:02.389 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 13:13:02.389 
2024-05-21T11:13:02.387Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 13:13:02.388 
2024-05-21T11:13:02.386Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 13:13:02.388 
 client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 13:13:02.359 
2024-05-21T11:13:02.357Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 13:13:02.358 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 13:12:56.854 
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="Template with change_mode restart re-rendered" failed=false
2024-05-21 13:12:56.854 
2024-05-21T11:12:56.853Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="Template with change_mode restart re-rendered" failed=false
2024-05-21 13:12:45.309 
 agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"
2024-05-21 13:12:45.309 
2024-05-21T11:12:45.308Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"

tgross commented 1 month ago

@OfficerKoo your logs are mangled such that timestamps don't line up (and in reverse). But it looks like they're showing the template runner restarted the task at 2024-05-21T11:12:56.853Z, but then it was still trying to download the Docker image to start the task when the next re-render happened at 2024-05-21T11:13:03.623Z. As in, it didn't restart the task for the second template re-render simply because it hadn't started it from a few seconds before yet.

Then I see a gap of 2 hours before there's a user-initiated restart. Most of what I can tell here from the logs looks right... Nomad can't restart a task that hasn't yet started. If I'm missing something, it might help if you could show this with debug-level logs that have consistent timestamps.

OfficerKoo commented 1 month ago

@tgross sorry about that, looks like logs got all twisted up when i were exporting them. You are right, looks like second re-render happens a second after Docker image downloading, what can we do to circumvent this? I already tried to add 1 minute splay and minimum 1 minute wait.

We will try to enable debug logs on one client, but this will happen next week probably.

tgross commented 1 month ago

You are right, looks like second re-render happens a second after Docker image downloading, what can we do to circumvent this? I already tried to add 1 minute splay and minimum 1 minute wait.

Ok, just wanted to make sure I was reading that correctly. The splay should help with this, so long as the splay length is longer than the window between the two re-renders + the task start.

Taking a quick pass over the code it looks like we send the Restart method and then immediately return to the loop that's watching for re-renders. That Restart method returns an error if the task isn't running... but the template re-render loop ignores the error so it's never reported! So at the very least there's a bug where we should be logging the error so that you as the operator know what the heck is going on. I suspect this may have been changed at some point, because now that I'm seeing this all together this looks like you could be running into what was described in https://github.com/hashicorp/nomad/issues/5459

But fixing the log wouldn't fix the underlying problem... we could drop re-render events such that the task start could race with any environment set by `template.env = true` (as you've done here). I suspect we want to queue-up re-render events that we get from the template runner, ideally coalescing any that were received in the same restart window.

We will try to enable debug logs on one client, but this will happen next week probably.

Thanks! If you look at the logs and see anything you feel is too sensitive to share publicly, you can email them to nomad-oss-debug@hashicorp.com, which is only visible to members of the Nomad engineering team.

ygersie commented 1 month ago

Hate to hijack the issue but I've seen an issue with workloads not receiving a signal after updates occasionally as well. In my case what happens is that I have a workload with dynamic certs. When Nomad agent is restarted templates with dynamic certs are always refreshed, but in some cases the signal event isn't triggered.

I see in the log:

2024-06-03T14:47:15.810Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=task-1 type=Received msg="Task received by client" failed=false
2024-06-03T14:47:15.810Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=task2 type=Received msg="Task received by client" failed=false
2024-06-03T14:47:16.965Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/f8c245e8-c418-dc35-b3b3-43f558a41d74/task-1/secrets/server.key"
2024-06-03T14:47:16.976Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/f8c245e8-c418-dc35-b3b3-43f558a41d74/task-2/secrets/server.crt"

As you can see I'm missing the expected re-rendered msg:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=precursor type=Signaling msg="Template re-rendered" failed=false

It doesn't happen often but often enough to occasionally break production workloads due to expired certificates. I also haven't been able to reproduce by restarting the same agent that had this issue before.

The job template splay is set to 0 so under normal circumstances a re-render would immediately trigger a signal.

OfficerKoo commented 1 month ago

@tgross Hi, so we enabled debug logs, but didn't catch anything suspicious, it does look that error happens when container restarts, but only suspicious part in logs is

2024-06-12 18:09:49.063 agent: (runner) checking template 8d86a9f697ea6be20429ce0f1cffc9ae
2024-06-12 18:09:49.063 agent: (runner) missing data for 4 dependencies
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(kv/data/ci/prod/env)
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(kv/data/ci/production/m2m)
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(prod-database/creds/selectors)
2024-06-12 18:09:49.064 agent: (runner) missing dependency: vault.read(prod-database/creds/uploads)
2024-06-12 18:09:49.067 agent: (runner) add used dependency vault.read(kv/data/ci/prod/env) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.070 agent: (runner) add used dependency vault.read(kv/data/ci/production/m2m) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.070 agent: (runner) add used dependency vault.read(prod-database/creds/selectors) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.071 agent: (runner) add used dependency vault.read(prod-database/creds/uploads) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.071 agent: (runner) was not watching 4 dependencies

Which is happens right after final config is rendered. I also noticed that in final config both wait and splay set to 0 or disabled, which is 100% not the case in job definition I attached full logs, allocation id is 63378fde-1456-96b8-2fc5-09cbbb288624, template id is 8d86a9f697ea6be20429ce0f1cffc9ae Explore-logs-2024-06-14 16_12_32.txt

hashicorp / nomad