hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.95k stars 1.96k forks source link

vault block config merging regression #19482

Closed tgross closed 10 months ago

tgross commented 11 months ago

@joliver reported in https://github.com/hashicorp/nomad/pull/19439#issuecomment-1856371614 that the fix for #19380 may have introduced a regression in merging configuration files for the vault block:

It appears there's one small regression in this. Our configuration is spread across multiple files which are merged together using the Nomad configuration merge behavior

00-vault.hcl:

vault {
  address = "https://vault.company.com:8200"
}

25-server.hcl:

vault {
  enabled          = true
  create_from_role = "role-name-here"
}

In Vault 1.6.x and up through Vault 1.7.1, the above behavior was working properly. With the introduction of Vault 1.7.2, I believe the merge behavior has changed because of the later file not specifying a Vault address such that a default address is used which then overwrites/hide the value specified in the earlier file.

tgross commented 11 months ago

Weirdly, if I'd had to guess I'd assume this regression was introduced in https://github.com/hashicorp/nomad/pull/19349, but that shipped in Nomad 1.7.1.

joliver commented 11 months ago

You're likely correct. I'm working with some others on this and it appears to be from 1.7.1. My initial report was incorrect. That said, we are seeing this behavior in 1.7.2.

lgfa29 commented 11 months ago

Hi @joliver 👋

I am not able to reproduce this problem. Loading the Nomad configuration from multiple files like you describe does result in a merged configuration.

Here's my test setup:

# 00-vault.hcl
vault {
  address = "http://127.0.0.1:8300" # Run Vault in non-default port for testing
}
# 25-server.hcl
vault {
  enabled          = true
  create_from_role = "nomad-cluster"
}

After starting the Nomad agent I can query it for its config and see that all vault blocks have been merged:

$ nomad operator api /v1/agent/self | jq '.config.Vaults'
[
  {
    "Addr": "http://127.0.0.1:8300",
    "AllowUnauthenticated": true,
    "ConnectionRetryIntv": 30000000000,
    "DefaultIdentity": null,
    "Enabled": true,
    "JWTAuthBackendPath": "jwt-nomad",
    "Name": "default",
    "Namespace": "",
    "Role": "nomad-cluster",
    "TLSCaFile": "",
    "TLSCaPath": "",
    "TLSCertFile": "",
    "TLSKeyFile": "",
    "TLSServerName": "",
    "TLSSkipVerify": null,
    "TaskTokenTTL": "",
    "Token": "<redacted>"
  }
]

Trying to run a job also yields the expected error, showing that Nomad is trying to connect to the non-default address set in the configuration file:

    2023-12-14T16:26:01.512-0500 [WARN]  nomad.vault: failed to contact Vault API: retry=30s error="Get \"http://127.0.0.1:8300/v1/sys/health?drsecondarycode=299&performancestandbycode=299&sealedcode=299&standbycode=299&uninitcode=299\": dial tcp 127.0.0.1:8300: connect: connection refused"

Could you provide additional information?

  1. What exact behaviour are you experience? Reproduction steps, log messages, screenshots etc. would be great!
  2. Which versions of Nomad are running on clients and servers?

Thanks!

joliver commented 11 months ago

We are running Nomad v1.7.2 on Ubuntu 22.04 (amd64).

Using the steps you provided above, we were able to produce the exact same thing you found. In the scenario you've outlined where everything works, the 00-vault.hcl and 25-server.hcl files reside in the same directory. As a result, the configuration file are merged as expected.

After looking further, we were able to find the method which consistently demonstrates the regression. Here's a copy of the relevant line of our systemd unit file:

ExecStart=/usr/bin/nomad agent -config /etc/nomad.d/ -config /var/tmpfs/nomad.d/

In the above scenario, we identified that the various configuration files reside in separate directories, e.g. /etc/nomad.d/00-vault.hcl and /var/tmpfs/nomad/25-server.hcl. In this case, we get the following:

$ cat /etc/nomad.d/00-vault.hcl
vault {
  address = "https://vault.company.com:8200"
}

$ nomad operator api /v1/agent/self | jq '.config.Vaults'
[
  {
    "Addr": "https://vault.service.consul:8200",
    "AllowUnauthenticated": true,
    "ConnectionRetryIntv": 30000000000,
    "DefaultIdentity": null,
    "Enabled": true,
    "JWTAuthBackendPath": "jwt-nomad",
    "Name": "default",
    "Namespace": "",
    "Role": "nomad-delegated-policy-access",
    "TLSCaFile": "",
    "TLSCaPath": "",
    "TLSCertFile": "",
    "TLSKeyFile": "",
    "TLSServerName": "",
    "TLSSkipVerify": null,
    "TaskTokenTTL": "192h",
    "Token": "<redacted>"
  }
]

The real question now is if cross-directory merging should even be possible/supported at all. We have made the appropriate adjustments to our configuration scripts to solve this above problem. It was an unexpected consequence an upgrade from 1.6.3 to 1.7.2.

lgfa29 commented 11 months ago

The real question now is if cross-directory merging should even be possible/supported at all.

Yes, it's supported. From our docs:

-config=<path>: Specifies the path to a configuration file or a directory of configuration files to load. Can be specified multiple times.

I have been trying to reproduce this problem, but I haven't been able to. Do you see both files being loaded when the agent starts?

Dec 19 17:53:15 vm systemd[1]: Started Nomad.
Dec 19 17:53:15 vm nomad[101393]: ==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
Dec 19 17:53:15 vm nomad[101393]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Dec 19 17:53:15 vm nomad[101393]: ==> Loaded configuration from /etc/nomad.d/00-vault.hcl, /etc/nomad.d/nomad.hcl, /var/tmpfs/nomad.d/25-server.hcl
Dec 19 17:53:15 vm nomad[101393]: ==> Starting Nomad agent...
Dec 19 17:53:29 vm nomad[101393]: ==> Nomad agent configuration:
Dec 19 17:53:29 vm nomad[101393]:        Advertise Addrs: HTTP: 192.168.67.2:4646; RPC: 192.168.67.2:4647; Serf: 192.168.67.2:4648
Dec 19 17:53:29 vm nomad[101393]:             Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Dec 19 17:53:29 vm nomad[101393]:                 Client: true
Dec 19 17:53:29 vm nomad[101393]:              Log Level: INFO
Dec 19 17:53:29 vm nomad[101393]:                Node Id: ce2019d1-9fe1-7e5b-6d3d-0d082a48ca0d
Dec 19 17:53:29 vm nomad[101393]:                 Region: global (DC: dc1)
Dec 19 17:53:29 vm nomad[101393]:                 Server: true
Dec 19 17:53:29 vm nomad[101393]:                Version: 1.7.2
Dec 19 17:53:29 vm nomad[101393]: ==> Nomad agent started! Log data will stream in below:
joliver commented 10 months ago

Over the past few days, we've been working through this and haven't been able to isolate the exact cause of the issue. We do observe that moving the /etc/nomad.d/00-vault.hcl file into /var/tmpfs/nomad.d/ resolves the issue. We have investigated where there are other files overwriting the Vault configuration, but there any. We have further tried to reorder the numeric sequence values of the files themselves. The only reproducible behavior is when the file crosses the directory boundary things appear to start working.

For completeness, here's the syslog entry:

Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Loaded configuration from /etc/nomad.d/00-agent.hcl, /etc/nomad.d/00-vault.hcl, /var/tmpfs/nomad.d/10-server.hcl
Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Starting Nomad agent...
Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Nomad agent configuration:
Dec 29 19:08:32 nomad-controller nomad[978233]:        Advertise Addrs: HTTP: REDACTED:4646; RPC: REDACTED:4647; Serf: REDACTED:4648
Dec 29 19:08:32 nomad-controller nomad[978233]:             Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Dec 29 19:08:32 nomad-controller nomad[978233]:                 Client: false
Dec 29 19:08:32 nomad-controller nomad[978233]:              Log Level: INFO
Dec 29 19:08:32 nomad-controller nomad[978233]:                Node Id: 50fb93c7-6e04-3b54-2381-e14a01d6c17e
Dec 29 19:08:32 nomad-controller nomad[978233]:                 Region: hetzner-ash-001 (DC: default)
Dec 29 19:08:32 nomad-controller nomad[978233]:                 Server: true
Dec 29 19:08:32 nomad-controller nomad[978233]:                Version: 1.7.2
Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Nomad agent started! Log data will stream in below:
Dec 29 19:08:32 nomad-controller nomad[978233]:     2023-12-29T19:08:32.449Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
Dec 29 19:08:32 nomad-controller nomad[978233]:     2023-12-29T19:08:32.452Z [INFO]  nomad.raft: starting restore from snapshot: id=3235-185838-1703717787721 last-index=185838 last-term=3235 size-in-bytes=3036454

At this point, we have made workarounds and we are content to say that it's no longer an issue and perhaps we've got an extreme edge case. We recommend that this issue be closed so that the team can focus on more important matters.

lgfa29 commented 10 months ago

Thanks for the update @joliver. I am going to close this one since we can't seem to reproduce it and you have a workaround in place.

Feel free to reach out if you hit any other problem 🙂