Open ygersie opened 2 months ago
Hi @ygersie, can you please give us a little more context on what your problem is? When the vault client runs into a fatal error, something that will keep on returning an error in time, like a lease expiration, it will stop renewing it. If you provide us with more information, we might be able to have a better insight into your particular problem, and find a better way to help.
Hey @Juanadelacuesta, thanks for your response.
When the vault client runs into a fatal error, something that will keep on returning an error in time, like a lease expiration, it will stop renewing it.
Exactly, in this case Nomad incorrectly determines that this is a non-fatal
error and keeps retrying. There may be other errors that are boiling up now with the switch to WI that aren't caught in the error checking right now. I wonder if it wouldn't be better to also check the Vault response status code, if it's in the 4XX range it by definition should be fatal, right?
I've also seen something else that could be problematic. I'm running an extensive local lab setup of hashistack that includes multiple regions, mTLS and Vault + Consul integration using Workload Identity. If I leave my laptop to rest overnight in the morning a bunch of my jobs are dead and won't start anymore. Here's an example of such an alloc:
ID = 97415db2-cbb0-fb20-b907-54bb412376e3
Eval ID = e9bf10f1
Name = teleport-proxy.teleport-proxy[0]
Node ID = e7cb90f9
Node Name = agent-1
Job ID = teleport-proxy
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 1h9m ago
Modified = 28m2s ago
Reschedule Eligibility = 31m54s from now
Allocation Addresses (mode = "host"):
Label Dynamic Address
*proxy yes 192.168.1.132:3080
*diag yes 192.168.1.132:23153
Task "teleport-proxy" is "dead"
Task Resources:
CPU Memory Disk Addresses
18/100 MHz 64 MiB/512 MiB 300 MiB
Task Events:
Started At = 2024-09-09T07:43:24Z
Finished At = 2024-09-09T08:24:39Z
Total Restarts = 1
Last Restart = 2024-09-09T10:24:39+02:00
Recent Events:
Time Type Description
2024-09-09T10:24:39+02:00 Not Restarting Error was unrecoverable
2024-09-09T10:24:39+02:00 Task hook failed consul_task: 1 error occurred:
* failed to write Consul SI token: open /Users/ygersie/git/local-hashistack/data/nomad/us-east-1/1/alloc/97415db2-cbb0-fb20-b907-54bb412376e3/teleport-proxy/secrets/consul_token: permission denied
2024-09-09T10:24:39+02:00 Restarting Task restarting in 0s
2024-09-09T10:24:39+02:00 Terminated Exit Code: 0
2024-09-09T10:24:39+02:00 Restart Signaled Vault: new Vault token acquired
2024-09-09T09:43:24+02:00 Started Task started by client
2024-09-09T09:43:23+02:00 Task Setup Building Task Directory
2024-09-09T09:43:21+02:00 Received Task received by client
It looks like a new Vault token was acquired and a restart was executed triggering the Prestart
hooks, however, the consul_token
was created with permissions that does not allow writes:
-r--r-----@ 1 ygersie staff 36 Sep 9 09:43 /Users/ygersie/git/local-hashistack/data/nomad/us-east-1/1/alloc/97415db2-cbb0-fb20-b907-54bb412376e3/teleport-proxy/secrets/consul_token
Which is set here: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/consul_hook.go#L28
With workload identity I found myself in a scenario where Vault token renewal fails indefinitely. It looks like determining if this is a fatal error doesn't work correctly as the error is different than what is in this list: https://github.com/hashicorp/nomad/blob/36522ec6320b9663eca967ba1d6ebe7dfa856327/client/vaultclient/vaultclient.go#L439
In my setup this will never resolve I assume because of: