hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.79k stars 1.94k forks source link

Vault 'default' name is not set on server, #19901

Open rwenz3l opened 7 months ago

rwenz3l commented 7 months ago

Nomad version

Output from nomad version

client+server:

$ nomad version
Nomad v1.7.3
BuildDate 2024-01-15T16:55:40Z
Revision 60ee328f97d19d2d2d9761251b895b06d82eb1a1

Operating system and Environment details

3 VMs for the servers, many client nodes for the jobs. All running Rocky Linux release 9.3

Issue

I recently started looking into vault integrations - and while this worked in the past, I noticed with a recent test on the newer version that I get an error when scheduling jobs:

Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
        * Vault "default" not enabled but used in the job)

The error comes from here:

https://github.com/hashicorp/nomad/blob/1e04fc461394d96bd4aab0e50cfa80048e1b5fd0/nomad/job_endpoint_hook_vault.go#L38

The name options is mentioned here and mention that it should be omitted for non-enterprise setups: https://developer.hashicorp.com/nomad/docs/configuration/vault#parameters-for-nomad-clients-and-servers

The job spec mentions it here: https://developer.hashicorp.com/nomad/docs/job-specification/vault#cluster

my vault config on the servers was like this:

vault {
  enabled          = true
  token            = "{{ nomad_vault_token }}"
  address          = "https://vault.****.com"
  create_from_role = "nomad-cluster-access-auth"
}

config of the job/task:

      vault {
        # Attach our default policies to the task,
        # so it is able to retrieve secrets from vault.
        policies = ["nomad-cluster-access-kv"]
      }

I noticed there has been some work done on this, e.g. here: https://github.com/hashicorp/nomad/commit/1ef99f05364b7d3739befa6a789f0d55b2314dcf

and I think there might be a bug with the initialization of the "default" value.. It's either not set or not read.

Reproduction steps

I think one might be able to reproduce this by setting up a 1.7.3 cluster and simply integrating vault. If I add the name = default to both server and client, it works. If I don't I get the mentioned error message.

Expected Result

The "default" cluster is available by default.

Actual Result

Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
        * Vault "default" not enabled but used in the job)
lgfa29 commented 7 months ago

Hi @rwenz3l 👋

I have not been able to reproduce this problem 🤔

Would you be able to share the Vault configuration as returned by the command /v1/agent/self? You will need to run this on each of your servers.

Could you also make sure all three servers are running Nomad v1.7.3?

Thanks!

rwenz3l commented 7 months ago

Sure:

ctrl1 ```json "Vaults": [ { "Addr": "https://vault.*******.com", "AllowUnauthenticated": true, "ConnectionRetryIntv": 30000000000, "DefaultIdentity": null, "Enabled": true, "JWTAuthBackendPath": "jwt-nomad", "Name": "default", "Namespace": "", "Role": "nomad-cluster-access-auth", "TLSCaFile": "", "TLSCaPath": "", "TLSCertFile": "", "TLSKeyFile": "", "TLSServerName": "", "TLSSkipVerify": null, "TaskTokenTTL": "", "Token": "" } ], ```
ctrl2 ```json "Vaults": [ { "Addr": "https://vault.********.com", "AllowUnauthenticated": true, "ConnectionRetryIntv": 30000000000, "DefaultIdentity": null, "Enabled": true, "JWTAuthBackendPath": "jwt-nomad", "Name": "default", "Namespace": "", "Role": "nomad-cluster-access-auth", "TLSCaFile": "", "TLSCaPath": "", "TLSCertFile": "", "TLSKeyFile": "", "TLSServerName": "", "TLSSkipVerify": null, "TaskTokenTTL": "", "Token": "" } ], ```
ctrl3 ```json "Vaults": [ { "Addr": "https://vault.**********.com", "AllowUnauthenticated": true, "ConnectionRetryIntv": 30000000000, "DefaultIdentity": null, "Enabled": true, "JWTAuthBackendPath": "jwt-nomad", "Name": "default", "Namespace": "", "Role": "nomad-cluster-access-auth", "TLSCaFile": "", "TLSCaPath": "", "TLSCertFile": "", "TLSKeyFile": "", "TLSServerName": "", "TLSSkipVerify": null, "TaskTokenTTL": "", "Token": "" } ], ```

I will continue my work on the vault integration and gather some more info with this.

lgfa29 commented 7 months ago

Thanks for the extra information @rwenz3l.

All three configuration look right, "Name": "default" and "Enabled": true. Is the cluster a fresh install or have upgraded the servers from a previous version of Nomad.

As an aside, you mentioned you're just starting to look into the Vault integration so I would suggest you to follow the new workflow released in Nomad 1.7 as this will become the only supported option in the future. Here's a tutorial that covers it: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-acl

rwenz3l commented 7 months ago

We've been running this nomad since 1.3 or something iirc., we usually update to the latest major/minor shortly after release.

We definitely plan to use the new workload identities with this, I initially configured the vault integration when workload identity did not exist yet. It was working fine back then, so I guess it was something before the 1.7.x that maybe did something to this key/value. From my limited view, it feels like the default value was not read properly, if the key is missing in the nomad.hcl configuration. No need to invest too much time, I would advise to set the "name = default" in the nomad config in case someone sees this error. If I find some more info, I will update here.

Tirieru commented 6 months ago

I had the same error after configuring the Vault integration following the new 1.7 workflow.

After a few tries, I realized this was caused by a syntax error, and I was missing a comma inside the vault block of the nomad config file (which in my case, is written in json).

I would expect Nomad to not start at all while having a syntax error in the json config file, but apparently it only made the Vault integration not work? It might be something similar in your case.

lgfa29 commented 5 months ago

Thanks for the extra info @Tirieru. Improving agent configuration validation is something that's been on our plate for a bit now (https://github.com/hashicorp/nomad/pull/11819).

Would you be able to share the exact invalid configuration that caused this error? I have not been able to reproduce it yet.

Thanks!

Tirieru commented 5 months ago

This is how the nomad server configuration looked like while the error was happening:

{
  "name": "nomad-1",
  "data_dir": "/opt/nomad/data",
  "bind_addr": "<HOST_ADDRESS>",
  "datacenter": "dc1",
  "ports": {
    "http": 4646,
    "rpc": 4647,
    "serf": 4648
  },
  "addresses": {
    "http": "0.0.0.0",
    "rpc": "0.0.0.0",
    "serf": "0.0.0.0"
  },
  "advertise": {
    "http": "<HOST_ADDRESS>",
    "rpc": "<HOST_ADDRESS>",
    "serf": "<HOST_ADDRESS>"
  },
  "acl": {
    "enabled": true
  },
  "server": {
    "enabled": true,
    "rejoin_after_leave": true,
    "raft_protocol": 3,
    "encrypt": "<ENCRYPT_KEY>",
    "bootstrap_expect": 1,
    "job_gc_interval": "1h",
    "job_gc_threshold": "24h",
    "deployment_gc_threshold": "120h",
    "heartbeat_grace": "60s"
  },
  "limits": {
    "http_max_conns_per_client": 300,
    "rpc_max_conns_per_client": 300
  },
  "vault": {
    "token": "<VAULT_TOKEN>",
    "create_from_role": "nomad-cluster",
    "default_identity": {
      "aud": ["<VAULT_AUD>"],
      "ttl": ""
    }
    "address": "<VAULT_ADDRESS>",
    "enabled": true  
  },
  "log_level": "INFO"
}

Adding the missing comma on line 45 fixed the issue.

lgfa29 commented 5 months ago

Thank you @Tirieru!

Yes, I can verify that the invalid JSON does cause the same error message but, unlike in the case of @rwenz3l, the /v1/agent/self API does return the default Vault configuration as disabled:

    "Vaults": [
      {
        "Addr": "https://vault.service.consul:8200",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": {
          "Audience": [
            "vault.io"
          ],
          "Env": null,
          "File": null,
          "TTL": null
        },
        "Enabled": null,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],

I'm not sure why this configuration is accepted though. I think the root cause is that Nomad agent configuration is still parsed with the old HCLv1 syntax, which has a less strict JSON parser.