Closed beninghton closed 10 months ago
Thanks for the report @beninghton.
It seems like the error is coming from here, which is a very old part of the code: https://github.com/hashicorp/nomad/blob/3067191c5197cacda71b0643bea4327f13bd8ce3/nomad/node_endpoint.go#L1793-L1796
The problem is that when task lifecycle was implemented this part of the code was not updated to consider this use case: the allocation is set to be ServerTerminal
but it hasn't finished yet because it has to finish its poststop
task.
I have placed this issue in our board for further trigage and roadmapping.
Does anyone have a workaround for this ? Hitting the same issue, in a poststop task which must access KV from vault
In my case, as I had both a prestart and a poststop task needing the same KV secret from vault, what I did
config {
image = "myimage:latest"
# Get the API password from prestart-task as we cannot get a vault token
# during poststop, see https://github.com/hashicorp/nomad/issues/16886
volumes = [
"../prestart-task/secrets/my-api-key.env:/secrets/my-api-key.env:ro"
]
}
There's another limitation of poststop tasks (but it's kind of more expected) : you can't rely on the service mesh, because the envoy sidecar is terminated before the poststop tasks are executed.
Leaving a note here that starting in Nomad 1.7.0-beta.1 we've deprecated the workflow where the clients makes a request to the server to mint Vault/Consul tokens and instead the Nomad client is making that request using the workload identity.
This should eliminate the specific bug that @lgfa29 is discussing inasmuch as we won't be talking to the Nomad server anymore and checking that state. But that doesn't necessarily close out this bug (I haven't tested this case yet). I'll hold off closing until we've confirmed that.
We have a very similar issue, but with a Nomad variable as opposed to a Vault key, so I'm adding the detail here hoping it will help.
Nomad v1.6.3 BuildDate 2023-10-30T12:58:10Z Revision e0497bff14378d68cad76a801cc0eba93ce05039
Ubuntu 22.04.3 LTS (Jammy)
The 2nd poststop
task can't get a Nomad variable, but the other tasks and the 1st poststop
task work fine.
We have a job with 5 tasks:
setup-task
(working)teardown-task
(problematic)All these tasks make use of template blocks that refer to the job’s corresponding Nomad variable, e.g., nomad/jobs/the-job-name
:
Example job:
job "the-job-name" {
group "the-job-group" {
task "setup-task" {
lifecycle {
hook = "prestart"
sidecar = false
}
driver = "raw_exec"
template {
change_mode = "noop" # this is a prestart task
error_on_missing_key = true
destination = "$${NOMAD_SECRETS_DIR}/.setup"
env = true
data = <<EOT
{{ with nomadVar (printf "nomad/jobs/%v" (env "NOMAD_JOB_NAME")) }}
KEY1 = {{ .key1 }}
KEY2 = {{ .key2 }}
{{- end }}
EOT
}
config {
command = "/path/to/setup/script.sh"
args = [
KEY1,
KEY2
]
}
}
// not showing all the tasks
// they all have template blocks similar to the ones above and below
task "teardown-task" {
lifecycle {
hook = "poststop"
sidecar = false
}
driver = "raw_exec"
template {
change_mode = "noop" # this is a poststop task
error_on_missing_key = true
destination = "$${NOMAD_SECRETS_DIR}/.teardown"
env = true
data = <<EOT
{{ with nomadVar (printf "nomad/jobs/%v" (env "NOMAD_JOB_NAME")) }}
KEY1 = {{ .key1 }}
KEY2 = {{ .key2 }}
{{- end }}
EOT
}
config {
command = "/path/to/teardown/script.sh"
args = [
KEY1,
KEY2
]
}
}
}
}
The Nomad job runs as expected except for the last poststop teardown-task
. When it is supposed to execute it eventually results in a timeout after a couple of minutes (±12 attempts):
Nomad logs:
2023-10-26T08:12:18.266Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:12:18.267Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:12:18.267Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:12:18.268Z [WARN] agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (retry attempt 11 after "1m0s")
2023-10-26T08:13:18.276Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.3:4647
2023-10-26T08:13:18.276Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.3:4647
2023-10-26T08:13:18.277Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:13:18.278Z [WARN] agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (retry attempt 12 after "1m0s")
2023-10-26T08:14:18.284Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:14:18.285Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:14:18.285Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:14:18.286Z [ERROR] agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (exceeded maximum retries)
2023-10-26T08:14:18.287Z [ERROR] agent: (runner) watcher reported error: nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)
2023-10-26T08:14:18.287Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task type=Killing msg="Template failed: nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)" failed=true
2023-10-26T08:14:18.290Z [INFO] client.gc: marking allocation for GC: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c
2023-10-26T08:14:22.292Z [WARN] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task @module=logmon timestamp=2023-10-26T08:14:22.291Z
2023-10-26T08:14:22.293Z [WARN] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task @module=logmon timestamp=2023-10-26T08:14:22.292Z
2023-10-26T08:14:22.296Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task path=/usr/bin/nomad pid=2570689
2023-10-26T08:14:22.297Z [INFO] agent: (runner) stopping
We also see this in the web UI:
The main issues seem to be:
I can assure you that the Nomad variable that it’s complaining about in the screenshot (whose name corresponds with the job name as per the example above) DOES exist.
As a workaround to the above scenario with 2 post stop tasks, I decided to combine the 2 post stop tasks into a single post stop task (seeing as the 1st post stop task was successful). This also failed with the Missing: nomad.var.block(nomad/jobs/the-job-name@default.global)
event.
I also tried these workarounds:
(Mentioned in comment above)
close
(working)teardown-all
(problematic - uses template with Consul key and another template with Nomad Var)Finding: The single poststop task failed.
I went back to the original, 2 poststop tasks and added a dummy.
close
(working - uses template with Consul key)teardown-task
(problematic - uses template with Nomad Var)dummy
(working - doesn't use template)Finding: The poststop tasks are ordered alphabetically, so the dummy
task was executed 2nd, instead of last.
I changed the name of dummy
to zzz-dummy
in an attempt to ensure the dummy task is executed/ordered last.
close
(working - uses template with Consul key)teardown-task
(problematic - uses template with Nomad Var)zzz-dummy
(working - doesn't use template)Finding: The poststop tasks were ordered like I wanted, but they are executed concurrently, so it didn't make a difference.
A poststop task works as long as it DOES NOT make use of a template with a Nomad Var. In other words, if a poststop tasks makes use of a template with Nomad Var then it fails.
Hi folks! I've been able to confirm that the new Workload Identity workflow fixes the Vault issue here. And I've confirmed that the fix proposed by @lgfa29 will correct the legacy workflow. I'll have a PR up for that shortly.
@dani, the Nomad Variables issue you've described is similar but not quite the same. It was already fixed in the Nomad 1.7.0 branch as part of a major auth refactoring. But I missed backporting the bug fix, so I'll follow-up on that with a PR shortly.
Reproduction of the Vault issue.
diff --git a/nomad/node_endpoint.go b/nomad/node_endpoint.go
index 35c598641b..d3d4d0db3f 100644
--- a/nomad/node_endpoint.go
+++ b/nomad/node_endpoint.go
@@ -1860,7 +1860,7 @@ func (n *Node) DeriveVaultToken(args *structs.DeriveVaultTokenRequest, reply *st
setError(fmt.Errorf("Allocation %q not running on Node %q", args.AllocID, args.NodeID), false)
return nil
}
- if alloc.TerminalStatus() {
+ if alloc.ClientTerminalStatus() {
setError(fmt.Errorf("Can't request Vault token for terminal allocation"), false)
return nil
}
For the legacy workflow, I've created a secret and written an appropriate policy. I'm using the following vault
block configuration:
vault {
address = "http://127.0.0.1:8200"
enabled = true
token = "$redacted"
}
I'm using the following jobspec to test.
Before the patch, after stopping the job, I get the error described above:
$ nomad alloc status e91a
...
Task "poststop" (poststop) is "dead"
Task Resources:
CPU Memory Disk Addresses
128 MHz 128 MiB 300 MiB
Task Events:
Started At = N/A
Finished At = 2023-12-01T16:14:18Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2023-12-01T11:14:21-05:00 Template Missing: vault.read(secret/data/myapp)
2023-12-01T11:14:18-05:00 Killing Vault: server failed to derive vault token: Can't request Vault token for terminal allocation
2023-12-01T11:14:18-05:00 Task Setup Building Task Directory
2023-12-01T11:13:18-05:00 Received Task received by client
After the patch, after stopping the job, everything works as expected:
Task "poststop" (poststop) is "dead"
Task Resources:
CPU Memory Disk Addresses
0/128 MHz 0 B/128 MiB 300 MiB
Task Events:
Started At = 2023-12-01T16:27:35Z
Finished At = 2023-12-01T16:27:35Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2023-12-01T11:27:35-05:00 Terminated Exit Code: 0
2023-12-01T11:27:35-05:00 Started Task started by client
2023-12-01T11:27:35-05:00 Task Setup Building Task Directory
2023-12-01T11:27:06-05:00 Received Task received by client
I also wanted to make sure poststop tasks work with the new workflow, which I suspected but hadn't verified.
My vault
block configuration for the new workflow is as follows:
vault {
address = "http://127.0.0.1:8200"
enabled = true
default_identity {
aud = ["vault.io"]
ttl = "1h"
}
}
I've configured auth method, binding rules, etc. as needed, adjusted the path of the secret, and added a vault.role
field to the job spec.
After running and stopping that job, it works as expected:
Task "poststop" (poststop) is "dead"
Task Resources:
CPU Memory Disk Addresses
0/128 MHz 0 B/128 MiB 300 MiB
Task Events:
Started At = 2023-12-01T16:51:18Z
Finished At = 2023-12-01T16:51:18Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2023-12-01T11:51:18-05:00 Terminated Exit Code: 0
2023-12-01T11:51:18-05:00 Started Task started by client
2023-12-01T11:50:17-05:00 Task Setup Building Task Directory
2023-12-01T11:49:53-05:00 Received Task received by client
Fix has landed and will get backported to Nomad 1.6.x and 1.5.x
Nomad version
Nomad v1.5.2 BuildDate 2023-03-21T22:54:38Z Revision 9a2fdb5f53dce81edf2802f0b64962e07596fd03
Operating system and Environment details
NAME="Fedora Linux" VERSION="35.20220313.3.1 (CoreOS)"
Issue
Task with "poststop" hook lifecycle can't get a key/value from vault. But task in the same group but with "prestart" lifecycle is able.
Reproduction steps
You can run any task with poststop hook lifecycle. You can place vault block with policy inside the task block, or outside, under job/group blocks. Or in all the places simultaneously, doesn't matter. It will not work.
Expected Result
Key/value is derived from vault.
Actual Result
Key/value is not derived from vault. Poststop task is failed. In UI dashboard I see:
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)