Open jdoss opened 2 years ago
Hi @jdoss; when running templates that use Nomad based functions, the HTTP request doesn't use the external API and instead uses a private internal HTTP interface and a custom dialer.
Our e2e test suite that runs nightly includes Nomad service template lookups and has TLS enabled which helps me believe this issue isn't related to TLS. I would also expect the closed
error message on the template failure to mention https/http if it were TLS related.
Do you have logs from the client running the template that fails to render ideally at debug level? This would include some useful template logging as well as other client related details. Is it possible there is some form of proxy between the clients and servers which might also be impacting connectivity or local host firewalls?
Hey @jrasell thanks for getting back with me.
There isn't an kind of proxy between my clients and servers or any localhost firewalls. Does your e2e tests use client auth with TLS? Also, I expect it is saying closed in the error message because the port 4646 isn't speaking http and nomad is trying to access the API over http?
[step@nomad-ephemeral-production-7 ~]$ curl "http://127.0.0.1:4646/v1/service/fluentd?namespace=default&stale=&wait=60000ms"
Client sent an HTTP request to an HTTPS server.
[step@nomad-ephemeral-production-7 ~]$ curl -I "http://127.0.0.1:4646/v1/service/fluentd?namespace=default&stale=&wait=60000ms"
HTTP/1.0 400 Bad Request
I will set up some test jobs that registers a service with Nomad again and see if I can get better info out of the logs.
Here is output from nomad monitor -log-level=DEBUG
on the cluster. I don't think it says much. I will try to figure out how to kick one of the clients into debug logging and pin the job to that client.
2022-06-09T14:50:48.119Z [DEBUG] worker: dequeued evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 type=service namespace=default job_id=leosc node_id="" triggered_by=job-register
2022-06-09T14:50:48.120Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 job_id=leosc namespace=default worker_id=634555b9-89e3-5b06-e987-c608fdda30c0
results=
| Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
| Created Deployment: "91302608-fcb9-fbdf-68ed-8a538d12603c"
| Desired Changes for "ssh": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
2022-06-09T14:50:48.133Z [DEBUG] worker: submitted plan for evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98
2022-06-09T14:50:48.133Z [DEBUG] worker.service_sched: setting eval status: eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 job_id=leosc namespace=default worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 status=complete
2022-06-09T14:50:48.149Z [DEBUG] worker: updated evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval="<Eval \"887b7552-86a4-ddf3-5e77-0f62a061ad98\" JobID: \"leosc\" Namespace: \"default\">"
2022-06-09T14:50:48.149Z [DEBUG] worker: ack evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 type=service namespace=default job_id=leosc node_id="" triggered_by=job-register
2022-06-09T14:50:48.155Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/887b7552-86a4-ddf3-5e77-0f62a061ad98 duration="248.858µs"
2022-06-09T14:50:48.194Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/887b7552-86a4-ddf3-5e77-0f62a061ad98/allocations duration="272.923µs"
2022-06-09T14:50:48.232Z [DEBUG] http: request complete: method=GET path="/v1/deployment/91302608-fcb9-fbdf-68ed-8a538d12603c?stale=&wait=2000ms" duration="438.06µs"
2022-06-09T14:50:48.501Z [DEBUG] nomad.client: adding evaluations for rescheduling failed allocations: num_evals=1
Well this is wild. I picked a random client node to add debug logging log_level = "DEBUG"
in the config and restarted nomad. Now things work.
I then rebooted all of the other client nodes and now they are allocating the job just fine and rendering the template using {{ range nomadService "fluentd" }}
. 😕
I am going to drain my all my clients and relaunch them with log_level = "DEBUG"
from the start and see if I can get anything useful.
Ok I got something useful. I thought I was taking crazy pills because I relaunched my client nodes with log_level = "DEBUG"
logging enabled and my job started without issues using the Nomad Service discovery. I then stopped the job to remove the debug logging on one of the nodes and started it again. The job then failed on a different node that had debug logging still enabled and I got some good logs.
You can see at the start of the logs it is accessing the API and the task started just fine. Then the working task stopping at 17:41:22:
Jun 09 17:41:19 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:41:19.692Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m2.9762357s
Jun 09 17:41:22 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:41:22.240Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m0.290887505s
Jun 09 17:41:26 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:41:26.058Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m2.841628849s
Jun 09 17:41:26 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:41:26.095Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m2.791039609s
Jun 09 17:41:30 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:41:30.705Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m3.21155867s
Jun 09 17:41:30 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:41:30.928Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m1.972011845s
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:42:09.368Z [DEBUG] client: updated allocations: index=11114 total=6 pulled=6 filtered=0
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:42:09.368Z [DEBUG] client: allocation updates: added=0 removed=0 updated=6 ignored=0
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:42:09.371Z [INFO] client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=63d785bd-37c0-bcb9-11d5-e800ffe2e820/ssh/05a836de timestamp=2022-06-09T17:42:09.371Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:42:09.373Z [INFO] client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman signal="" taskID=7fa33f82-02b0-95af-88a4-1ffeb6d1a23f/ssh/f8a0c62c @module=podman timestamp=2022-06-09T17:42:09.373Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:42:09.375Z [INFO] client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=dd408cea-c1a8-9af0-6582-1a3773ca34ab/ssh/b093b975 timestamp=2022-06-09T17:42:09.375Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:42:09.381Z [INFO] client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=f35e2208-9e0c-a216-aad4-e935193caec4/ssh/1d5c7bfe timestamp=2022-06-09T17:42:09.381Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:42:09.389Z [INFO] client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=21a070ba-d041-8ede-f7d8-44ce6ba7e2bd/ssh/009131d9 timestamp=2022-06-09T17:42:09.389Z
I have automation in place that will regenerate the TLS certs from my CA and then restart Nomad on a timer. You can see the restart here:
Jun 09 17:45:21 nomad-ephemeral-production-1 systemd[1]: Reloading nomad.service - Nomad...
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]: ==> Caught signal: hangup
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]: ==> Reloading configuration...
Jun 09 17:45:21 nomad-ephemeral-production-1 systemd[1]: Reloaded nomad.service - Nomad.
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:45:21.868Z [DEBUG] agent: starting reload of agent config
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:45:21.869Z [DEBUG] agent: starting reload of client config
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:45:21.869Z [INFO] agent: reloading HTTP server with new TLS configuration
and then at 17:51:26 I start the job and it fails because it cannot access the the Nomad Service API:
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.139Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/opt/nomad/data/alloc/0b04b5cf-f7c3-fb16-2042-ef8a06fb3159/ssh/local/fluent-bit.conf"
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.145Z [INFO] agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/4715790d-e447-e464-aa3d-232be4a04e6f/ssh/local/fluent-bit.conf"
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.145Z [DEBUG] agent: (runner) checking template e410d2764b6e6460f1f327d1fe83d5b3
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.145Z [DEBUG] agent: (runner) missing data for 1 dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) missing dependency: nomad.service(fluentd)
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) add used dependency nomad.service(fluentd) to missing since isLeader but do not have a watcher
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) was not watching 1 dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.146Z [DEBUG] agent: (watcher) adding nomad.service(fluentd)
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) diffing and updating dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) watching 1 dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]: 2022-06-09T17:51:27.147Z [ERROR] agent: (runner) watcher reported error: nomad.service(fluentd): Get "http://127.0.0.1:4646/v1/service/fluentd?namespace=default&stale=&wait=60000ms": closed
Maybe the service discovery feature is not reloading correctly to use the new TLS certs?
Hi @jdoss and thanks for the extra detail. There is a lot to unpack and comment on so i'll try and do this in order with some markers using quotes from your comments.
Does your e2e tests use client auth with TLS?
Yes, the configuration file is available within this repo and matches your setup.
I expect it is saying closed in the error message because the port 4646 isn't speaking http and nomad is trying to access the API over http?
As you show in your example, a well behaving HTTP API and client responds with Client sent an HTTP request to an HTTPS server
when a client makes a non-https request to a HTTPS enabled endpoint. The closed
message is coming from net.ErrClosed which indicates the template process is making a network call to an already closed connection.
Maybe the service discovery feature is not reloading correctly to use the new TLS certs?
In my previous comment I mentioned how the template runner uses a custom listener and dialer which ties into the RPC layer and is unrelated to the public API where the TLS certificates are used. I therefore still believe the TLS aspect to be a red-herring.
That being said, when reloading the Nomad agent a number of actions occur which sometimes includes connection pool updates for the client -> server RPC connection. Along with the "closed" message being seen, I believe this is the area in which the bug exists. Hypothesis theory being that we are reloading the RPC connection pool without then informing the custom dialer of this change which means it is using a closed connection.
I'll mark this issue as needing a little further investigation to solidify the cause and put it onto our backlog board for roadmapping. Thanks again for your detailed troubleshooting.
No problem @jrasell! I am glad I was able to help and thanks for responding in such detail as well. 😃
Hey @jrasell any update on getting this sorted out? I'd like to move away from Consul service discovery over to Nomad service discovery so I can use it with traefik's new Nomad integration. https://traefik.io/blog/traefik-proxy-fully-integrates-with-hashicorp-nomad/
Hi @jdoss, the team have unfortunately been busy on other prioritised work. We are doing some roadmapping in the coming weeks though, so I will keep this issue in mind and add any updates as they are known.
Hey @jrasell @tgross @lgfa29 I just was submitting a panic issue I found on Nomad v1.5.0-beta.1 and I remembered this issue that is related to using short lived TLS certs. I will do some testing on v1.5.0-beta.1 to see if the issues with Nomad service discovery still exists and I will report back.
Any chance this issue can get some eyes on it after the 1.5.0 GA dust settles if things are still wonky?
I just wanted to follow up issue. I believe this is still a problem which impacts other parts of Nomad such as the Vault Integration.
I am no longer seeing the issue within the Nomad Service discovery feature. The original issue seems to be fixed as the internal RPC connection pool for the Nomad Service discovery seems to be reloading now with the new TLS cert.
This however is an issue with the Vault Integration. I use the same CA for my Nomad and Vault TLS. I used short lived TLS certs (valid for 24hr) and renew them on a systemd timer every 12hr. My vault cluster currently has a valid TLS cert which I can access from the server that is running nomad.
Valid TLS cert on Vault
# step certificate inspect https://xxx.xxx.xxx.xxx:8200
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 211686295302191930181129435020294053738 (0x9f414c288f3c39b68dccb35fc0135f6a)
Signature Algorithm: ECDSA-SHA256
Issuer: CN=Internal CA Intermediate CA
Validity
Not Before: Apr 14 09:07:44 2023 UTC
Not After : Apr 15 09:08:44 2023 UTC
I can use mTLS to access Vault from the nomad server using the nomad certs
# curl --key /etc/nomad/tls/nomad.key --cert /etc/nomad/tls/nomad.crt --cacert /etc/nomad/tls/nomad-ca.crt https://xxx.xxx.xxx.xxx:8200/
<a href="/ui/">Temporary Redirect</a>.
Vault is enabled in Nomad
vault {
enabled = true
ca_file = "/etc/nomad/tls/nomad-ca.crt"
cert_file = "/etc/nomad/tls/nomad.crt"
key_file = "/etc/nomad/tls/nomad.key"
address = "https://xxx.xxx.xxx.xxx:8200"
token = "tokentokentoken"
create_from_role = "nomad-cluster"
}
I am seeing this in my Nomad logs
# journalctl -u nomad.service -f
Apr 14 17:58:04 nomad[41606]: 2023-04-14T17:58:04.527Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 17:58:04 nomad[41606]: 2023-04-14T17:58:04.527Z [INFO] nomad.vault: backing off renewal: retry=1m54s
Apr 14 18:00:02 nomad[41606]: 2023-04-14T18:00:02.124Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:00:02 nomad[41606]: 2023-04-14T18:00:02.125Z [INFO] nomad.vault: backing off renewal: retry=1m58s
Apr 14 18:02:04 nomad[41606]: 2023-04-14T18:02:04.294Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:02:04 nomad[41606]: 2023-04-14T18:02:04.294Z [INFO] nomad.vault: backing off renewal: retry=1m55s
Apr 14 18:04:02 nomad[41606]: 2023-04-14T18:04:02.944Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:04:02 nomad[41606]: 2023-04-14T18:04:02.944Z [INFO] nomad.vault: backing off renewal: retry=1m39s
Apr 14 18:05:45 nomad[41606]: 2023-04-14T18:05:45.413Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:05:45 nomad[41606]: 2023-04-14T18:05:45.413Z [INFO] nomad.vault: backing off renewal: retry=1m39s
You can see here the Nomad gets reloaded after the TLS cert renews agent: reloading HTTP server with new TLS configuration
but then the old expired cert is still being used and it fails to use mTLS to on the Vault endpoint.
Apr 14 09:13:53 systemd[1]: Reloading nomad.service - Nomad...
Apr 14 09:13:53 nomad[41606]: ==> Caught signal: hangup
Apr 14 09:13:53 nomad[41606]: ==> Reloading configuration...
Apr 14 09:13:53 systemd[1]: Reloaded nomad.service - Nomad.
Apr 14 09:13:53 nomad[41606]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.286Z [INFO] nomad: reloading server connections due to configuration changes
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.287Z [ERROR] nomad.rpc: failed to accept RPC conn: error="accept tcp xxx.xxx.xxx.xxx:4647: use of closed network connection" delay=5ms
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.287Z [INFO] nomad.rpc: closing server RPC connection
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.288Z [INFO] client.fingerprint_mgr: reloading fingerprinter: fingerprinter=cni
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.288Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=xxx.xxx.xxx.xxx:4647
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.288Z [INFO] agent: reloading HTTP server with new TLS configuration
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.515Z [INFO] nomad.rpc: closing server RPC connection
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.613Z [INFO] nomad.rpc: closing server RPC connection
Apr 14 09:13:53 nomad[41606]: 2023-04-14T09:13:53.615Z [INFO] nomad.rpc: closing server RPC connection
Apr 14 09:14:09 nomad[41606]: 2023-04-14T09:14:09.106Z [INFO] nomad.rpc: closing server RPC connection
Apr 14 09:14:58 nomad[41606]: 2023-04-14T09:14:58.198Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:14:58 nomad[41606]: 2023-04-14T09:14:58.198Z [INFO] nomad.vault: backing off renewal: retry=1m48s
Apr 14 09:16:50 nomad[41606]: 2023-04-14T09:16:50.265Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:16:50 nomad[41606]: 2023-04-14T09:16:50.265Z [INFO] nomad.vault: backing off renewal: retry=1m34s
Apr 14 09:18:27 nomad[41606]: 2023-04-14T09:18:27.729Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:18:27 nomad[41606]: 2023-04-14T09:18:27.729Z [INFO] nomad.vault: backing off renewal: retry=1m2s
Apr 14 09:19:33 nomad[41606]: 2023-04-14T09:19:33.818Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:19:33 nomad[41606]: 2023-04-14T09:19:33.819Z [INFO] nomad.vault: backing off renewal: retry=1m20s
Apr 14 09:20:57 nomad[41606]: 2023-04-14T09:20:57.561Z [WARN] nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:20:57 nomad[41606]: 2023-04-14T09:20:57.561Z [INFO] nomad.vault: backing off renewal: retry=1m44s
If I restart nomad (not reload) with systemctl restart nomad.service
it is no longer getting a remote error: tls: bad certificate error but the vault token has expired since it could not be renewed.
Apr 14 18:21:43 systemd[1]: Stopped nomad.service - Nomad.
Apr 14 18:21:43 systemd[1]: nomad.service: Consumed 11h 35min 26.463s CPU time.
Apr 14 18:21:44 systemd[1]: Starting nomad.service - Nomad...
Apr 14 18:21:44 systemd[1]: Started nomad.service - Nomad.
Apr 14 18:21:44 nomad[119017]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Apr 14 18:21:44 nomad[119017]: ==> Loaded configuration from /etc/nomad/config/nomad.hcl
Apr 14 18:21:44 nomad[119017]: ==> Starting Nomad agent...
Apr 14 18:21:49 nomad[119017]: ==> Nomad agent configuration:
Apr 14 18:21:49 nomad[119017]: Advertise Addrs: HTTP: xxx.xxx.xxx.xxx:4646; RPC: xxx.xxx.xxx.xxx:4647; Serf: xxx.xxx.xxx.xxx:4648
Apr 14 18:21:49 nomad[119017]: Bind Addrs: HTTP: [xxx.xxx.xxx.xxx:4646 127.0.0.1:4646]; RPC: xxx.xxx.xxx.xxx:4647; Serf: xxx.xxx.xxx.xxx:4648
Apr 14 18:21:49 nomad[119017]: Client: true
Apr 14 18:21:49 nomad[119017]: Log Level: INFO
Apr 14 18:21:49 nomad[119017]: Region: us-central (DC: central)
Apr 14 18:21:49 nomad[119017]: Server: true
Apr 14 18:21:49 nomad[119017]: Version: 1.5.2
Apr 14 18:21:49 nomad[119017]: ==> Nomad agent started! Log data will stream in below:
Apr 14 18:21:49 nomad[119017]: 2023-04-14T18:21:44.629Z [INFO] nomad: setting up raft bolt store: no_freelist_sync=false
Apr 14 18:21:49 nomad[119017]: 2023-04-14T18:21:44.632Z [INFO] nomad.raft: starting restore from snapshot: id=45-89569-1681240368076 last-index=89569 last-term=45 size-in-bytes=610130
Apr 14 18:21:49 nomad[119017]: 2023-04-14T18:21:44.653Z [INFO] nomad.raft: snapshot restore progress: id=45-89569-1681240368076 last-index=89569 last-term=45 size-in-bytes=610130 read-bytes=610130 percent-complete="100.00%"
Apr 14 18:21:49 nomad[119017]: 2023-04-14T18:21:44.653Z [INFO] nomad.raft: restored from snapshot: id=45-89569-1681240368076 last-index=89569 last-term=45 size-in-bytes=610130
Apr 14 18:21:49 nomad[119017]: 2023-04-14T18:21:44.719Z [ERROR] nomad.vault: failed to validate self token/role: retry=30s
Apr 14 18:21:49 nomad[119017]: error=
Apr 14 18:21:49 nomad[119017]: | failed to lookup Vault periodic token: Error making API request.
Apr 14 18:21:49 nomad[119017]: |
Apr 14 18:21:49 nomad[119017]: | URL: POST https://xxx.xxx.xxx.xxx:8200/v1/auth/token/lookup
Apr 14 18:21:49 nomad[119017]: | Code: 403. Errors:
Apr 14 18:21:49 nomad[119017]: |
Apr 14 18:21:49 nomad[119017]: | * permission denied
Apr 14 18:21:49 nomad[119017]:
This issue is pretty glaring with short lived TLS certs, but other folks that are using longer TLS certs will most certainly run into this issue as well when they renew their TLS certs and reload Nomad which will cause issues and downtime on Nomad jobs that depend on Vault.
Nomad version
Operating system and Environment details
Fedora CoreOS Stable
Issue
I have run the Nomad TLS gauntlet and have my cluster running with full TLS for clients and agents. I cannot use Nomad Service Discovery because it is not using https for the URL of the Nomad API and it is not using TLS to authenticate to the Nomad API. Thankfully, I also ran the Consul TLS gauntlet and I can fallback to Consul for service discovery.
Here is the configuration from one of the nodes:
Reproduction steps
My jobs other jobs that are rendering template like this
to get the service IPs and ports of the registered job below. The job fails because the I require TLS with this error:
Expected Result
A template that renders and a job that starts.
Actual Result
Tears and sorrow.
Job file (if appropriate)
The job that is registering services with Nomad: