hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.9k stars 1.95k forks source link

[WI] Vault renew self fails #23859

Open ygersie opened 2 months ago

ygersie commented 2 months ago

With workload identity I found myself in a scenario where Vault token renewal fails indefinitely. It looks like determining if this is a fatal error doesn't work correctly as the error is different than what is in this list: https://github.com/hashicorp/nomad/blob/36522ec6320b9663eca967ba1d6ebe7dfa856327/client/vaultclient/vaultclient.go#L439

    2024-08-22T18:45:35.294+0200 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default
  error=
  | failed to renew the vault token: Error making API request.
  |
  | URL: PUT http://localhost:8200/v1/auth/token/renew-self
  | Code: 400. Errors:
  |
  | * lease expired

In my setup this will never resolve I assume because of:

  template {
    vault_retry {
      attempts    = 0
      backoff     = "250ms"
      max_backoff = "5m"
    }
  }
Juanadelacuesta commented 1 month ago

Hi @ygersie, can you please give us a little more context on what your problem is? When the vault client runs into a fatal error, something that will keep on returning an error in time, like a lease expiration, it will stop renewing it. If you provide us with more information, we might be able to have a better insight into your particular problem, and find a better way to help.

ygersie commented 1 month ago

Hey @Juanadelacuesta, thanks for your response.

When the vault client runs into a fatal error, something that will keep on returning an error in time, like a lease expiration, it will stop renewing it.

Exactly, in this case Nomad incorrectly determines that this is a non-fatal error and keeps retrying. There may be other errors that are boiling up now with the switch to WI that aren't caught in the error checking right now. I wonder if it wouldn't be better to also check the Vault response status code, if it's in the 4XX range it by definition should be fatal, right?

I've also seen something else that could be problematic. I'm running an extensive local lab setup of hashistack that includes multiple regions, mTLS and Vault + Consul integration using Workload Identity. If I leave my laptop to rest overnight in the morning a bunch of my jobs are dead and won't start anymore. Here's an example of such an alloc:

ID                     = 97415db2-cbb0-fb20-b907-54bb412376e3
Eval ID                = e9bf10f1
Name                   = teleport-proxy.teleport-proxy[0]
Node ID                = e7cb90f9
Node Name              = agent-1
Job ID                 = teleport-proxy
Job Version            = 0
Client Status          = failed
Client Description     = Failed tasks
Desired Status         = run
Desired Description    = <none>
Created                = 1h9m ago
Modified               = 28m2s ago
Reschedule Eligibility = 31m54s from now

Allocation Addresses (mode = "host"):
Label   Dynamic  Address
*proxy  yes      192.168.1.132:3080
*diag   yes      192.168.1.132:23153

Task "teleport-proxy" is "dead"
Task Resources:
CPU         Memory          Disk     Addresses
18/100 MHz  64 MiB/512 MiB  300 MiB

Task Events:
Started At     = 2024-09-09T07:43:24Z
Finished At    = 2024-09-09T08:24:39Z
Total Restarts = 1
Last Restart   = 2024-09-09T10:24:39+02:00

Recent Events:
Time                       Type              Description
2024-09-09T10:24:39+02:00  Not Restarting    Error was unrecoverable
2024-09-09T10:24:39+02:00  Task hook failed  consul_task: 1 error occurred:
    * failed to write Consul SI token: open /Users/ygersie/git/local-hashistack/data/nomad/us-east-1/1/alloc/97415db2-cbb0-fb20-b907-54bb412376e3/teleport-proxy/secrets/consul_token: permission denied
2024-09-09T10:24:39+02:00  Restarting        Task restarting in 0s
2024-09-09T10:24:39+02:00  Terminated        Exit Code: 0
2024-09-09T10:24:39+02:00  Restart Signaled  Vault: new Vault token acquired
2024-09-09T09:43:24+02:00  Started           Task started by client
2024-09-09T09:43:23+02:00  Task Setup        Building Task Directory
2024-09-09T09:43:21+02:00  Received          Task received by client

It looks like a new Vault token was acquired and a restart was executed triggering the Prestart hooks, however, the consul_token was created with permissions that does not allow writes:

-r--r-----@ 1 ygersie  staff  36 Sep  9 09:43 /Users/ygersie/git/local-hashistack/data/nomad/us-east-1/1/alloc/97415db2-cbb0-fb20-b907-54bb412376e3/teleport-proxy/secrets/consul_token

Which is set here: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/consul_hook.go#L28