hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.56k stars 1.92k forks source link

Nomad client's Vault client gets stuck on the wrong Vault namespace #22230

Closed the-nando closed 2 days ago

the-nando commented 1 month ago

Nomad version

Nomad v1.7.7+ent
BuildDate 2024-04-16T19:58:29Z
Revision d0c2458a9c62314d727c3516e63d92026639cc89

Issue

I've run into an odd issue which I'm trying to make sense of, hopefully someone can help.
If I submit a job with a namespace set in the Vault stanza which fails to deploy, for instance because the role specified doesn't exist, the value of the namespace seems to persist across job versions, i.e. I'm unable to set namespace = "". Interestingly enough the value seems to persist even across job purges.

Reproduction steps

Run the following job:

job "test" {
  region = "region-1"
  datacenters = ["region-1a"]
  type        = "service"
  namespace = "default"

  group "nginx" {
    count = 1

     vault {
       namespace = ""
       cluster = "default"
       role = "nomad-workloads-error" 
      }

    task "nginx" {
      driver = "docker"

      config {
        image = "alpine:3.18.5"
        command = "/bin/sh"
        args = ["-c", "while true ; do echo sleeping; sleep 1 ; done"]
      }

      resources {
        cpu    = 301
        memory = 128
      }
    }
  }
}

Error in the client logs as expected, since the role doesn't exist:

  | Vault: failed to derive vault token: failed to derive Vault token for identity vault_default: failed to login with JWT: Error making API request.
  |
  | URL: PUT http://localhost:8200/v1/auth/nomad-region-1/login
  | Code: 400. Errors:
  |
  | * role "nomad-workloads-error" could not be found

Update the job spec and resubmit the job:

vault {
 namespace = "foobar"
 cluster = "default"
 role = "nomad-workloads-error" 
}

Error as expected:

  | Vault: failed to derive vault token: failed to derive Vault token for identity vault_default: failed to login with JWT: Error making API request.
  |
  | Namespace: foobar
  | URL: PUT http://localhost:8200/v1/auth/nomad-region-1/login
  | Code: 400. Errors:
  |
  | * role "nomad-workloads-error" could not be found

Stop and purge the job:

nomad job stop -purge test

Update the job spec and re-submit:

vault {
  namespace = ""
  cluster = "default"
  role = "nomad-workloads-error" 
}

Error:

  | Vault: failed to derive vault token: failed to derive Vault token for identity vault_default: failed to login with JWT: Error making API request.
  |
  | Namespace: foobar
  | URL: PUT http://localhost:8200/v1/auth/nomad-region-1/login
  | Code: 400. Errors:

Where does Namespace: foobar come from? A Nomad restart on the client "fixes" the problem.

Nomad client config:

vault {
  address = "http://localhost:8200"
  enabled = true
  name = "default"

  default_identity {
    aud = ["vault.io"]
    ttl = "15m"
  }

  jwt_auth_backend_path = "nomad-region-1"
}
tgross commented 1 month ago

@the-nando can you confirm that the job itself still has the right Vault namespace? i.e. if you run nomad job inspect?

Otherwise, your note that "A Nomad restart on the client "fixes" the problem" leads me to think this is a problem in the client. Because a Vault API client is expensive to set up because of TLS, we reuse it between operations. But we have logic that's supposed to reset the namespace and token (ref vaultclient.go#L252-L261)

ygersie commented 1 month ago

When you do a job inspect there’s no namespace set. It’s indeed an issue on the client, the only way to fix it is by restarting the agent.

tgross commented 3 days ago

Hi @the-nando! I'm picking this up and fortunately/unfortunately was able to reproduce this with a very simple unit test. I'll have a fix up shortly.

tgross commented 3 days ago

PR: https://github.com/hashicorp/nomad/pull/23491