Closed tgross closed 10 months ago
Weirdly, if I'd had to guess I'd assume this regression was introduced in https://github.com/hashicorp/nomad/pull/19349, but that shipped in Nomad 1.7.1.
You're likely correct. I'm working with some others on this and it appears to be from 1.7.1. My initial report was incorrect. That said, we are seeing this behavior in 1.7.2.
Hi @joliver 👋
I am not able to reproduce this problem. Loading the Nomad configuration from multiple files like you describe does result in a merged configuration.
Here's my test setup:
# 00-vault.hcl
vault {
address = "http://127.0.0.1:8300" # Run Vault in non-default port for testing
}
# 25-server.hcl
vault {
enabled = true
create_from_role = "nomad-cluster"
}
After starting the Nomad agent I can query it for its config and see that all vault
blocks have been merged:
$ nomad operator api /v1/agent/self | jq '.config.Vaults'
[
{
"Addr": "http://127.0.0.1:8300",
"AllowUnauthenticated": true,
"ConnectionRetryIntv": 30000000000,
"DefaultIdentity": null,
"Enabled": true,
"JWTAuthBackendPath": "jwt-nomad",
"Name": "default",
"Namespace": "",
"Role": "nomad-cluster",
"TLSCaFile": "",
"TLSCaPath": "",
"TLSCertFile": "",
"TLSKeyFile": "",
"TLSServerName": "",
"TLSSkipVerify": null,
"TaskTokenTTL": "",
"Token": "<redacted>"
}
]
Trying to run a job also yields the expected error, showing that Nomad is trying to connect to the non-default address set in the configuration file:
2023-12-14T16:26:01.512-0500 [WARN] nomad.vault: failed to contact Vault API: retry=30s error="Get \"http://127.0.0.1:8300/v1/sys/health?drsecondarycode=299&performancestandbycode=299&sealedcode=299&standbycode=299&uninitcode=299\": dial tcp 127.0.0.1:8300: connect: connection refused"
Could you provide additional information?
Thanks!
We are running Nomad v1.7.2 on Ubuntu 22.04 (amd64).
Using the steps you provided above, we were able to produce the exact same thing you found. In the scenario you've outlined where everything works, the 00-vault.hcl
and 25-server.hcl
files reside in the same directory. As a result, the configuration file are merged as expected.
After looking further, we were able to find the method which consistently demonstrates the regression. Here's a copy of the relevant line of our systemd unit file:
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d/ -config /var/tmpfs/nomad.d/
In the above scenario, we identified that the various configuration files reside in separate directories, e.g. /etc/nomad.d/00-vault.hcl
and /var/tmpfs/nomad/25-server.hcl
. In this case, we get the following:
$ cat /etc/nomad.d/00-vault.hcl
vault {
address = "https://vault.company.com:8200"
}
$ nomad operator api /v1/agent/self | jq '.config.Vaults'
[
{
"Addr": "https://vault.service.consul:8200",
"AllowUnauthenticated": true,
"ConnectionRetryIntv": 30000000000,
"DefaultIdentity": null,
"Enabled": true,
"JWTAuthBackendPath": "jwt-nomad",
"Name": "default",
"Namespace": "",
"Role": "nomad-delegated-policy-access",
"TLSCaFile": "",
"TLSCaPath": "",
"TLSCertFile": "",
"TLSKeyFile": "",
"TLSServerName": "",
"TLSSkipVerify": null,
"TaskTokenTTL": "192h",
"Token": "<redacted>"
}
]
The real question now is if cross-directory merging should even be possible/supported at all. We have made the appropriate adjustments to our configuration scripts to solve this above problem. It was an unexpected consequence an upgrade from 1.6.3 to 1.7.2.
The real question now is if cross-directory merging should even be possible/supported at all.
Yes, it's supported. From our docs:
-config=<path>
: Specifies the path to a configuration file or a directory of configuration files to load. Can be specified multiple times.
I have been trying to reproduce this problem, but I haven't been able to. Do you see both files being loaded when the agent starts?
Dec 19 17:53:15 vm systemd[1]: Started Nomad.
Dec 19 17:53:15 vm nomad[101393]: ==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
Dec 19 17:53:15 vm nomad[101393]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Dec 19 17:53:15 vm nomad[101393]: ==> Loaded configuration from /etc/nomad.d/00-vault.hcl, /etc/nomad.d/nomad.hcl, /var/tmpfs/nomad.d/25-server.hcl
Dec 19 17:53:15 vm nomad[101393]: ==> Starting Nomad agent...
Dec 19 17:53:29 vm nomad[101393]: ==> Nomad agent configuration:
Dec 19 17:53:29 vm nomad[101393]: Advertise Addrs: HTTP: 192.168.67.2:4646; RPC: 192.168.67.2:4647; Serf: 192.168.67.2:4648
Dec 19 17:53:29 vm nomad[101393]: Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Dec 19 17:53:29 vm nomad[101393]: Client: true
Dec 19 17:53:29 vm nomad[101393]: Log Level: INFO
Dec 19 17:53:29 vm nomad[101393]: Node Id: ce2019d1-9fe1-7e5b-6d3d-0d082a48ca0d
Dec 19 17:53:29 vm nomad[101393]: Region: global (DC: dc1)
Dec 19 17:53:29 vm nomad[101393]: Server: true
Dec 19 17:53:29 vm nomad[101393]: Version: 1.7.2
Dec 19 17:53:29 vm nomad[101393]: ==> Nomad agent started! Log data will stream in below:
Over the past few days, we've been working through this and haven't been able to isolate the exact cause of the issue. We do observe that moving the /etc/nomad.d/00-vault.hcl
file into /var/tmpfs/nomad.d/
resolves the issue. We have investigated where there are other files overwriting the Vault configuration, but there any. We have further tried to reorder the numeric sequence values of the files themselves. The only reproducible behavior is when the file crosses the directory boundary things appear to start working.
For completeness, here's the syslog
entry:
Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Loaded configuration from /etc/nomad.d/00-agent.hcl, /etc/nomad.d/00-vault.hcl, /var/tmpfs/nomad.d/10-server.hcl
Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Starting Nomad agent...
Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Nomad agent configuration:
Dec 29 19:08:32 nomad-controller nomad[978233]: Advertise Addrs: HTTP: REDACTED:4646; RPC: REDACTED:4647; Serf: REDACTED:4648
Dec 29 19:08:32 nomad-controller nomad[978233]: Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Dec 29 19:08:32 nomad-controller nomad[978233]: Client: false
Dec 29 19:08:32 nomad-controller nomad[978233]: Log Level: INFO
Dec 29 19:08:32 nomad-controller nomad[978233]: Node Id: 50fb93c7-6e04-3b54-2381-e14a01d6c17e
Dec 29 19:08:32 nomad-controller nomad[978233]: Region: hetzner-ash-001 (DC: default)
Dec 29 19:08:32 nomad-controller nomad[978233]: Server: true
Dec 29 19:08:32 nomad-controller nomad[978233]: Version: 1.7.2
Dec 29 19:08:32 nomad-controller nomad[978233]: ==> Nomad agent started! Log data will stream in below:
Dec 29 19:08:32 nomad-controller nomad[978233]: 2023-12-29T19:08:32.449Z [INFO] nomad: setting up raft bolt store: no_freelist_sync=false
Dec 29 19:08:32 nomad-controller nomad[978233]: 2023-12-29T19:08:32.452Z [INFO] nomad.raft: starting restore from snapshot: id=3235-185838-1703717787721 last-index=185838 last-term=3235 size-in-bytes=3036454
At this point, we have made workarounds and we are content to say that it's no longer an issue and perhaps we've got an extreme edge case. We recommend that this issue be closed so that the team can focus on more important matters.
Thanks for the update @joliver. I am going to close this one since we can't seem to reproduce it and you have a workaround in place.
Feel free to reach out if you hit any other problem 🙂
@joliver reported in https://github.com/hashicorp/nomad/pull/19439#issuecomment-1856371614 that the fix for #19380 may have introduced a regression in merging configuration files for the
vault
block: