hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Poststop lifecycle task - Can't request Vault token for terminal allocation #16886

Closed beninghton closed 10 months ago

beninghton commented 1 year ago

Nomad version

Nomad v1.5.2 BuildDate 2023-03-21T22:54:38Z Revision 9a2fdb5f53dce81edf2802f0b64962e07596fd03

Operating system and Environment details

NAME="Fedora Linux" VERSION="35.20220313.3.1 (CoreOS)"

Issue

Task with "poststop" hook lifecycle can't get a key/value from vault. But task in the same group but with "prestart" lifecycle is able.

Reproduction steps

You can run any task with poststop hook lifecycle. You can place vault block with policy inside the task block, or outside, under job/group blocks. Or in all the places simultaneously, doesn't matter. It will not work.

  task "poststoptask" {
    lifecycle {
      hook = "poststop"
    }
    vault {
      policies = ["any_vault_policy"]
    }
  }

Expected Result

Key/value is derived from vault.

Actual Result

Key/value is not derived from vault. Poststop task is failed. In UI dashboard I see:

Apr 14, '23 15:10:03 +0700 | Template | Missing: vault.read(kv/nomad-drain-operator/token)
Apr 14, '23 15:10:00 +0700 | Killing | Vault: server failed to derive vault token: Can't request Vault token for terminal allocation
Apr 14, '23 15:10:00 +0700 | Task Setup | Building Task Directory
Apr 14, '23 15:09:08 +0700 | Received | Task received by client

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.031Z [INFO]  agent: (runner) stopping
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.031Z [INFO]  agent: (runner) received finish
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.032Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task=gpu_initializer_post type="Task Setup" msg="Building Task Directory" failed=false
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.058Z [ERROR] client.vault: error deriving vault tokens: error="Can't request Vault token for terminal allocation"
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.058Z [ERROR] client.vault: error deriving token: error="Can't request Vault token for terminal allocation" alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task_names=["gpu_initializer_post"]
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.058Z [ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task=gpu_initializer_post error="Can't request Vault token for terminal allocation" server_side=true
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.058Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task=gpu_initializer_post type=Killing msg="Vault: server failed to derive vault token: Can't request Vault token for terminal allocation" failed=true
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.059Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.059Z [INFO]  agent: (runner) creating watcher
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.059Z [INFO]  agent: (runner) starting
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.060Z [INFO]  client.gc: marking allocation for GC: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.067Z [WARN]  agent: (view) vault.read(kv/nomad-drain-operator/token): vault.read(kv/nomad-drain-operator/token): Error making API request.
Apr 14 08:38:13 enc-08 nomad[1206]: URL: GET http://vault.wsoft.live:8200/v1/kv/nomad-drain-operator/token
Apr 14 08:38:13 enc-08 nomad[1206]: Code: 403. Errors:
Apr 14 08:38:13 enc-08 nomad[1206]: * permission denied (retry attempt 1 after "250ms")
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.324Z [WARN]  agent: (view) vault.read(kv/nomad-drain-operator/token): vault.read(kv/nomad-drain-operator/token): Error making API request.
Apr 14 08:38:13 enc-08 nomad[1206]: URL: GET http://vault.wsoft.live:8200/v1/kv/nomad-drain-operator/token
Apr 14 08:38:13 enc-08 nomad[1206]: Code: 403. Errors:
Apr 14 08:38:13 enc-08 nomad[1206]: * permission denied (retry attempt 2 after "500ms")
Apr 14 08:38:13 enc-08 nomad[1206]:     2023-04-14T08:38:13.827Z [WARN]  agent: (view) vault.read(kv/nomad-drain-operator/token): vault.read(kv/nomad-drain-operator/token): Error making API request.
Apr 14 08:38:13 enc-08 nomad[1206]: URL: GET http://vault.wsoft.live:8200/v1/kv/nomad-drain-operator/token
Apr 14 08:38:13 enc-08 nomad[1206]: Code: 403. Errors:
Apr 14 08:38:13 enc-08 nomad[1206]: * permission denied (retry attempt 3 after "1s")
Apr 14 08:38:14 enc-08 nomad[1206]:     2023-04-14T08:38:14.830Z [WARN]  agent: (view) vault.read(kv/nomad-drain-operator/token): vault.read(kv/nomad-drain-operator/token): Error making API request.
Apr 14 08:38:14 enc-08 nomad[1206]: URL: GET http://vault.wsoft.live:8200/v1/kv/nomad-drain-operator/token
Apr 14 08:38:14 enc-08 nomad[1206]: Code: 403. Errors:
Apr 14 08:38:14 enc-08 nomad[1206]: * permission denied (retry attempt 4 after "2s")
Apr 14 08:38:15 enc-08 nomad[1206]:     2023-04-14T08:38:15.569Z [INFO]  client: node registration complete
Apr 14 08:38:16 enc-08 nomad[1206]:     2023-04-14T08:38:16.060Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task=gpu_initializer_post type=Template msg="Missing: vault.read(kv/nomad-drain-operator/token)" failed=false
Apr 14 08:38:16 enc-08 nomad[1206]:     2023-04-14T08:38:16.841Z [WARN]  agent: (view) vault.read(kv/nomad-drain-operator/token): vault.read(kv/nomad-drain-operator/token): Error making API request.
Apr 14 08:38:16 enc-08 nomad[1206]: URL: GET http://vault.wsoft.live:8200/v1/kv/nomad-drain-operator/token
Apr 14 08:38:16 enc-08 nomad[1206]: Code: 403. Errors:
Apr 14 08:38:16 enc-08 nomad[1206]: * permission denied (retry attempt 5 after "4s")
Apr 14 08:38:17 enc-08 nomad[1206]:     2023-04-14T08:38:17.062Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task=gpu_initializer_post @module=logmon timestamp=2023-04-14T08:38:17.062Z
Apr 14 08:38:17 enc-08 nomad[1206]:     2023-04-14T08:38:17.062Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task=gpu_initializer_post @module=logmon timestamp=2023-04-14T08:38:17.062Z
Apr 14 08:38:17 enc-08 nomad[1206]:     2023-04-14T08:38:17.066Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=b1e02610-3291-0a39-5c0b-1f2d42cf3077 task=gpu_initializer_post path=/usr/local/bin/nomad pid=1110697
Apr 14 08:38:17 enc-08 nomad[1206]:     2023-04-14T08:38:17.066Z [INFO]  agent: (runner) stopping
Apr 14 08:38:17 enc-08 nomad[1206]:     2023-04-14T08:38:17.066Z [INFO]  agent: (runner) received finish
lgfa29 commented 1 year ago

Thanks for the report @beninghton.

It seems like the error is coming from here, which is a very old part of the code: https://github.com/hashicorp/nomad/blob/3067191c5197cacda71b0643bea4327f13bd8ce3/nomad/node_endpoint.go#L1793-L1796

The problem is that when task lifecycle was implemented this part of the code was not updated to consider this use case: the allocation is set to be ServerTerminal but it hasn't finished yet because it has to finish its poststop task.

I have placed this issue in our board for further trigage and roadmapping.

dani commented 1 year ago

Does anyone have a workaround for this ? Hitting the same issue, in a poststop task which must access KV from vault

dani commented 1 year ago

In my case, as I had both a prestart and a poststop task needing the same KV secret from vault, what I did

There's another limitation of poststop tasks (but it's kind of more expected) : you can't rely on the service mesh, because the envoy sidecar is terminated before the poststop tasks are executed.

tgross commented 11 months ago

Leaving a note here that starting in Nomad 1.7.0-beta.1 we've deprecated the workflow where the clients makes a request to the server to mint Vault/Consul tokens and instead the Nomad client is making that request using the workload identity.

This should eliminate the specific bug that @lgfa29 is discussing inasmuch as we won't be talking to the Nomad server anymore and checking that state. But that doesn't necessarily close out this bug (I haven't tested this case yet). I'll hold off closing until we've confirmed that.

heatzync commented 11 months ago

We have a very similar issue, but with a Nomad variable as opposed to a Vault key, so I'm adding the detail here hoping it will help.

Nomad version

Nomad v1.6.3 BuildDate 2023-10-30T12:58:10Z Revision e0497bff14378d68cad76a801cc0eba93ce05039

Operating system and Environment details

Ubuntu 22.04.3 LTS (Jammy)

Issue

The 2nd poststop task can't get a Nomad variable, but the other tasks and the 1st poststop task work fine.

Reproduction steps

We have a job with 5 tasks:

  1. Prestart task: setup-task (working)
  2. Main task (working)
  3. Post start task (working)
  4. 1st post stop task (working)
  5. 2nd post stop task: teardown-task (problematic)

All these tasks make use of template blocks that refer to the job’s corresponding Nomad variable, e.g., nomad/jobs/the-job-name:

Example job:

job "the-job-name" {
  group "the-job-group" {
    task "setup-task" {
      lifecycle {
        hook    = "prestart"
        sidecar = false
      }

      driver = "raw_exec"

      template {
        change_mode          = "noop" # this is a prestart task
        error_on_missing_key = true
        destination          = "$${NOMAD_SECRETS_DIR}/.setup"
        env                  = true
        data                 = <<EOT
{{ with nomadVar (printf "nomad/jobs/%v" (env "NOMAD_JOB_NAME")) }}
KEY1 = {{ .key1 }}
KEY2 = {{ .key2 }}
{{- end }}
EOT
      }

      config {
        command = "/path/to/setup/script.sh"
        args = [
          KEY1,
          KEY2
        ]
      }
    }

    // not showing all the tasks
    // they all have template blocks similar to the ones above and below

    task "teardown-task" {
      lifecycle {
        hook    = "poststop"
        sidecar = false
      }

      driver = "raw_exec"

      template {
        change_mode          = "noop" # this is a poststop task
        error_on_missing_key = true
        destination          = "$${NOMAD_SECRETS_DIR}/.teardown"
        env                  = true
        data                 = <<EOT
{{ with nomadVar (printf "nomad/jobs/%v" (env "NOMAD_JOB_NAME")) }}
KEY1 = {{ .key1 }}
KEY2 = {{ .key2 }}
{{- end }}
EOT
      }

      config {
        command = "/path/to/teardown/script.sh"
        args = [
          KEY1,
          KEY2
        ]
      }
    }
  }
}

The Nomad job runs as expected except for the last poststop teardown-task. When it is supposed to execute it eventually results in a timeout after a couple of minutes (±12 attempts):

Nomad logs:

2023-10-26T08:12:18.266Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:12:18.267Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:12:18.267Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:12:18.268Z [WARN]  agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (retry attempt 11 after "1m0s")

2023-10-26T08:13:18.276Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.3:4647
2023-10-26T08:13:18.276Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.3:4647
2023-10-26T08:13:18.277Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:13:18.278Z [WARN]  agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (retry attempt 12 after "1m0s")

2023-10-26T08:14:18.284Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:14:18.285Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:14:18.285Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:14:18.286Z [ERROR] agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (exceeded maximum retries)

2023-10-26T08:14:18.287Z [ERROR] agent: (runner) watcher reported error: nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)
2023-10-26T08:14:18.287Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task type=Killing msg="Template failed: nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)" failed=true
2023-10-26T08:14:18.290Z [INFO]  client.gc: marking allocation for GC: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c
2023-10-26T08:14:22.292Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task @module=logmon timestamp=2023-10-26T08:14:22.291Z
2023-10-26T08:14:22.293Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task @module=logmon timestamp=2023-10-26T08:14:22.292Z
2023-10-26T08:14:22.296Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task path=/usr/bin/nomad pid=2570689
2023-10-26T08:14:22.297Z [INFO]  agent: (runner) stopping

We also see this in the web UI: image

The main issues seem to be:

  1. Error performing RPC to server: allocation is terminal
  2. (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)

I can assure you that the Nomad variable that it’s complaining about in the screenshot (whose name corresponds with the job name as per the example above) DOES exist.

heatzync commented 11 months ago

As a workaround to the above scenario with 2 post stop tasks, I decided to combine the 2 post stop tasks into a single post stop task (seeing as the 1st post stop task was successful). This also failed with the Missing: nomad.var.block(nomad/jobs/the-job-name@default.global) event.

heatzync commented 11 months ago

I also tried these workarounds:

Workaround 1: Combined 2 poststop tasks into 1

(Mentioned in comment above)

  1. Prestart task: setup-task (working)
  2. Main task (working)
  3. Post start task: close (working)
  4. Combined post stop task: teardown-all (problematic - uses template with Consul key and another template with Nomad Var)

Finding: The single poststop task failed.

Workaround 2: Added a dummy task

I went back to the original, 2 poststop tasks and added a dummy.

  1. Prestart task: setup-task (working)
  2. Main task (working)
  3. Post start task (working)
  4. 1st post stop task: close (working - uses template with Consul key)
  5. 2nd post stop task: teardown-task (problematic - uses template with Nomad Var)
  6. 3rd post stop task: dummy (working - doesn't use template)

Finding: The poststop tasks are ordered alphabetically, so the dummy task was executed 2nd, instead of last.

Workaround 3: Added the dummy task at the end

I changed the name of dummy to zzz-dummy in an attempt to ensure the dummy task is executed/ordered last.

  1. Prestart task: setup-task (working)
  2. Main task (working)
  3. Post start task (working)
  4. 1st post stop task: close (working - uses template with Consul key)
  5. 2nd post stop task: teardown-task (problematic - uses template with Nomad Var)
  6. 3rd post stop task: zzz-dummy (working - doesn't use template)

Finding: The poststop tasks were ordered like I wanted, but they are executed concurrently, so it didn't make a difference.

Conclusion

A poststop task works as long as it DOES NOT make use of a template with a Nomad Var. In other words, if a poststop tasks makes use of a template with Nomad Var then it fails.

tgross commented 10 months ago

Hi folks! I've been able to confirm that the new Workload Identity workflow fixes the Vault issue here. And I've confirmed that the fix proposed by @lgfa29 will correct the legacy workflow. I'll have a PR up for that shortly.

@dani, the Nomad Variables issue you've described is similar but not quite the same. It was already fixed in the Nomad 1.7.0 branch as part of a major auth refactoring. But I missed backporting the bug fix, so I'll follow-up on that with a PR shortly.


Reproduction of the Vault issue.

diff --git a/nomad/node_endpoint.go b/nomad/node_endpoint.go
index 35c598641b..d3d4d0db3f 100644
--- a/nomad/node_endpoint.go
+++ b/nomad/node_endpoint.go
@@ -1860,7 +1860,7 @@ func (n *Node) DeriveVaultToken(args *structs.DeriveVaultTokenRequest, reply *st
                setError(fmt.Errorf("Allocation %q not running on Node %q", args.AllocID, args.NodeID), false)
                return nil
        }
-       if alloc.TerminalStatus() {
+       if alloc.ClientTerminalStatus() {
                setError(fmt.Errorf("Can't request Vault token for terminal allocation"), false)
                return nil
        }

For the legacy workflow, I've created a secret and written an appropriate policy. I'm using the following vault block configuration:

vault {
  address = "http://127.0.0.1:8200"
  enabled = true
  token   = "$redacted"
}

I'm using the following jobspec to test.

jobspec ```hcl job "example" { group "web" { network { mode = "bridge" port "www" { to = 8001 } } task "main" { driver = "docker" config { image = "busybox:1" command = "httpd" args = ["-v", "-f", "-p", "8001", "-h", "/local"] ports = ["www"] } resources { cpu = 100 memory = 100 } } task "poststop" { lifecycle { hook = "poststop" } driver = "raw_exec" vault { policies = ["nomad-default"] } config { command = "cat" args = ["${NOMAD_SECRETS_DIR}/access.key"] } template { data = <

Before the patch, after stopping the job, I get the error described above:

$ nomad alloc status e91a
...
Task "poststop" (poststop) is "dead"
Task Resources:
CPU      Memory   Disk     Addresses
128 MHz  128 MiB  300 MiB

Task Events:
Started At     = N/A
Finished At    = 2023-12-01T16:14:18Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2023-12-01T11:14:21-05:00  Template    Missing: vault.read(secret/data/myapp)
2023-12-01T11:14:18-05:00  Killing     Vault: server failed to derive vault token: Can't request Vault token for terminal allocation
2023-12-01T11:14:18-05:00  Task Setup  Building Task Directory
2023-12-01T11:13:18-05:00  Received    Task received by client

After the patch, after stopping the job, everything works as expected:

Task "poststop" (poststop) is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/128 MHz  0 B/128 MiB  300 MiB

Task Events:
Started At     = 2023-12-01T16:27:35Z
Finished At    = 2023-12-01T16:27:35Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2023-12-01T11:27:35-05:00  Terminated  Exit Code: 0
2023-12-01T11:27:35-05:00  Started     Task started by client
2023-12-01T11:27:35-05:00  Task Setup  Building Task Directory
2023-12-01T11:27:06-05:00  Received    Task received by client

I also wanted to make sure poststop tasks work with the new workflow, which I suspected but hadn't verified.

My vault block configuration for the new workflow is as follows:

vault {
  address = "http://127.0.0.1:8200"
  enabled = true

  default_identity {
    aud = ["vault.io"]
    ttl = "1h"
  }
}

I've configured auth method, binding rules, etc. as needed, adjusted the path of the secret, and added a vault.role field to the job spec.

After running and stopping that job, it works as expected:

Task "poststop" (poststop) is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/128 MHz  0 B/128 MiB  300 MiB

Task Events:
Started At     = 2023-12-01T16:51:18Z
Finished At    = 2023-12-01T16:51:18Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2023-12-01T11:51:18-05:00  Terminated  Exit Code: 0
2023-12-01T11:51:18-05:00  Started     Task started by client
2023-12-01T11:50:17-05:00  Task Setup  Building Task Directory
2023-12-01T11:49:53-05:00  Received    Task received by client
tgross commented 10 months ago

Fix has landed and will get backported to Nomad 1.6.x and 1.5.x