mTLS/reload support in Nomad Service Discovery

Nomad version

$ nomad version
Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56)

Operating system and Environment details

Fedora CoreOS Stable

Issue

I have run the Nomad TLS gauntlet and have my cluster running with full TLS for clients and agents. I cannot use Nomad Service Discovery because it is not using https for the URL of the Nomad API and it is not using TLS to authenticate to the Nomad API. Thankfully, I also ran the Consul TLS gauntlet and I can fallback to Consul for service discovery.

Here is the configuration from one of the nodes:

datacenter = "us-central1"
region = "us"
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

tls {
  http = true
  rpc = true

  ca_file = "/etc/nomad/tls/nomad-ca.crt"
  cert_file = "/etc/nomad/tls/nomad.crt"
  key_file = "/etc/nomad/tls/nomad.key"

  verify_server_hostname = true
  verify_https_client    = true
}

consul {
  address = "127.0.0.1:8501"
  ca_file    = "/etc/consul/tls/consul-ca.crt"
  cert_file  = "/etc/consul/tls/consul.crt"
  key_file   = "/etc/consul/tls/consul.key"
  ssl        = true
  verify_ssl = true

  token   = "shhhhhhhhh"
}
client {
  enabled = true
  meta {
    tags = "nomad-client"
  }
  options = {
    "driver.denylist" = "docker"
  }
  server_join {
    retry_join = ["provider=gce project_name=noooomaddddd zone_pattern=us-central1-.*"]
    retry_max = 3
    retry_interval = "30s"
  }
}
plugin "qemu" {
  config {
    image_paths = ["/opt/nomad/images"]
  }
}
plugin "nomad-driver-podman" {
  config {
    socket_path = "unix://var/run/podman/podman.sock"
    volumes {
      enabled      = true
      selinuxlabel = "z"
    }
  }
}

Reproduction steps

My jobs other jobs that are rendering template like this

template {
        change_mode = "noop"
        destination = "local/fluent-bit.conf"
        data = <<EOF
[SERVICE]
    flush        5
    daemon       Off
    log_level    info

[INPUT]
    Name        tcp
    Listen      127.0.0.1
    Port        5170
    Chunk_Size  32
    Buffer_Size 64
    Format      json

[FILTER]
    Name record_modifier
    Match *
    Record hostname ${HOSTNAME}

[UPSTREAM]
    name       forward-balancing

{{ range nomadService "fluentd" }}
[NODE]
    name       {{ env "NOMAD_SHORT_ALLOC_ID" }}
    host       {{ .Address }}
    port       {{ .Port }}
{{ end }}

[OUTPUT]
    Name          forward
    Match         *
    Upstream      forward-balancing
        EOF
      }

to get the service IPs and ports of the registered job below. The job fails because the I require TLS with this error:

Jun 08, '22 21:08:24 -0500  Killing Template failed: nomad.service(fluentd): Get "http://127.0.0.1:4646/v1/service/fluentd?namespace=default&stale=&wait=60000ms": closed

Expected Result

A template that renders and a job that starts.

Actual Result

Tears and sorrow.

Job file (if appropriate)

The job that is registering services with Nomad:

job "logging" {
  datacenters = ["us-central1"]
  type = "service"

  constraint {
    attribute    = "${meta.tags}"
    set_contains = "nomad-client"
  }

  update {
    stagger = "10s"
    max_parallel = 1
  }

  group "aggregation" {
    count = 3
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    network {
      port "fluentd" { to =  24224 }
      port "fluentd-monitor" { to =  24220 }
    }
    service {
      name = "fluentd"
      tags = ["logging", "fluentd"]
      provider = "nomad"
      port = "fluentd"
    }
    task "fluentd" {

      driver = "podman"

      config {
        image = "docker.io/fluent/fluentd:latest"
        volumes = [
          "local/fluent.conf:/fluentd/etc/fluent.conf"
        ]
        ports = ["fluentd", "fluentd-monitor"]
      }
      resources {
        cpu    = 1024
        memory = 1024
        memory_max = 2048
      }

      template {
        change_mode = "restart"
        destination = "local/fluent.conf"
        data =  <<EOF
<source>
  @type forward
  port 24224
</source>

<match *.**>
  @type stdout
</match>

<source>
  @type monitor_agent
  port 24220
</source>
        EOF
      }
    }
  }
}

Hi @jdoss; when running templates that use Nomad based functions, the HTTP request doesn't use the external API and instead uses a private internal HTTP interface and a custom dialer.

Our e2e test suite that runs nightly includes Nomad service template lookups and has TLS enabled which helps me believe this issue isn't related to TLS. I would also expect the closed error message on the template failure to mention https/http if it were TLS related.

Do you have logs from the client running the template that fails to render ideally at debug level? This would include some useful template logging as well as other client related details. Is it possible there is some form of proxy between the clients and servers which might also be impacting connectivity or local host firewalls?

Hey @jrasell thanks for getting back with me.

There isn't an kind of proxy between my clients and servers or any localhost firewalls. Does your e2e tests use client auth with TLS? Also, I expect it is saying closed in the error message because the port 4646 isn't speaking http and nomad is trying to access the API over http?

[step@nomad-ephemeral-production-7 ~]$ curl "http://127.0.0.1:4646/v1/service/fluentd?namespace=default&stale=&wait=60000ms"
Client sent an HTTP request to an HTTPS server.
[step@nomad-ephemeral-production-7 ~]$ curl -I "http://127.0.0.1:4646/v1/service/fluentd?namespace=default&stale=&wait=60000ms"
HTTP/1.0 400 Bad Request

I will set up some test jobs that registers a service with Nomad again and see if I can get better info out of the logs.

Here is output from nomad monitor -log-level=DEBUG on the cluster. I don't think it says much. I will try to figure out how to kick one of the clients into debug logging and pin the job to that client.

2022-06-09T14:50:48.119Z [DEBUG] worker: dequeued evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 type=service namespace=default job_id=leosc node_id="" triggered_by=job-register
2022-06-09T14:50:48.120Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 job_id=leosc namespace=default worker_id=634555b9-89e3-5b06-e987-c608fdda30c0
  results=
  | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Created Deployment: "91302608-fcb9-fbdf-68ed-8a538d12603c"
  | Desired Changes for "ssh": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2022-06-09T14:50:48.133Z [DEBUG] worker: submitted plan for evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98
2022-06-09T14:50:48.133Z [DEBUG] worker.service_sched: setting eval status: eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 job_id=leosc namespace=default worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 status=complete
2022-06-09T14:50:48.149Z [DEBUG] worker: updated evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval="<Eval \"887b7552-86a4-ddf3-5e77-0f62a061ad98\" JobID: \"leosc\" Namespace: \"default\">"
2022-06-09T14:50:48.149Z [DEBUG] worker: ack evaluation: worker_id=634555b9-89e3-5b06-e987-c608fdda30c0 eval_id=887b7552-86a4-ddf3-5e77-0f62a061ad98 type=service namespace=default job_id=leosc node_id="" triggered_by=job-register
2022-06-09T14:50:48.155Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/887b7552-86a4-ddf3-5e77-0f62a061ad98 duration="248.858µs"
2022-06-09T14:50:48.194Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/887b7552-86a4-ddf3-5e77-0f62a061ad98/allocations duration="272.923µs"
2022-06-09T14:50:48.232Z [DEBUG] http: request complete: method=GET path="/v1/deployment/91302608-fcb9-fbdf-68ed-8a538d12603c?stale=&wait=2000ms" duration="438.06µs"
2022-06-09T14:50:48.501Z [DEBUG] nomad.client: adding evaluations for rescheduling failed allocations: num_evals=1

Well this is wild. I picked a random client node to add debug logging log_level = "DEBUG" in the config and restarted nomad. Now things work.

I then rebooted all of the other client nodes and now they are allocating the job just fine and rendering the template using {{ range nomadService "fluentd" }}. 😕

I am going to drain my all my clients and relaunch them with log_level = "DEBUG" from the start and see if I can get anything useful.

Ok I got something useful. I thought I was taking crazy pills because I relaunched my client nodes with log_level = "DEBUG" logging enabled and my job started without issues using the Nomad Service discovery. I then stopped the job to remove the debug logging on one of the nodes and started it again. The job then failed on a different node that had debug logging still enabled and I got some good logs.

You can see at the start of the logs it is accessing the API and the task started just fine. Then the working task stopping at 17:41:22:

Jun 09 17:41:19 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:41:19.692Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m2.9762357s
Jun 09 17:41:22 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:41:22.240Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m0.290887505s
Jun 09 17:41:26 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:41:26.058Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m2.841628849s
Jun 09 17:41:26 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:41:26.095Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m2.791039609s
Jun 09 17:41:30 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:41:30.705Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m3.21155867s
Jun 09 17:41:30 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:41:30.928Z [DEBUG] http: request complete: method=GET path="/v1/service/fluentd?index=10191&namespace=default&stale=&wait=60000ms" duration=1m1.972011845s
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:42:09.368Z [DEBUG] client: updated allocations: index=11114 total=6 pulled=6 filtered=0
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:42:09.368Z [DEBUG] client: allocation updates: added=0 removed=0 updated=6 ignored=0
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:42:09.371Z [INFO]  client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=63d785bd-37c0-bcb9-11d5-e800ffe2e820/ssh/05a836de timestamp=2022-06-09T17:42:09.371Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:42:09.373Z [INFO]  client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman signal="" taskID=7fa33f82-02b0-95af-88a4-1ffeb6d1a23f/ssh/f8a0c62c @module=podman timestamp=2022-06-09T17:42:09.373Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:42:09.375Z [INFO]  client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=dd408cea-c1a8-9af0-6582-1a3773ca34ab/ssh/b093b975 timestamp=2022-06-09T17:42:09.375Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:42:09.381Z [INFO]  client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=f35e2208-9e0c-a216-aad4-e935193caec4/ssh/1d5c7bfe timestamp=2022-06-09T17:42:09.381Z
Jun 09 17:42:09 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:42:09.389Z [INFO]  client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman @module=podman signal="" taskID=21a070ba-d041-8ede-f7d8-44ce6ba7e2bd/ssh/009131d9 timestamp=2022-06-09T17:42:09.389Z

I have automation in place that will regenerate the TLS certs from my CA and then restart Nomad on a timer. You can see the restart here:

Jun 09 17:45:21 nomad-ephemeral-production-1 systemd[1]: Reloading nomad.service - Nomad...
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]: ==> Caught signal: hangup
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]: ==> Reloading configuration...
Jun 09 17:45:21 nomad-ephemeral-production-1 systemd[1]: Reloaded nomad.service - Nomad.
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:45:21.868Z [DEBUG] agent: starting reload of agent config
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:45:21.869Z [DEBUG] agent: starting reload of client config
Jun 09 17:45:21 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:45:21.869Z [INFO]  agent: reloading HTTP server with new TLS configuration

and then at 17:51:26 I start the job and it fails because it cannot access the the Nomad Service API:

Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.139Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/opt/nomad/data/alloc/0b04b5cf-f7c3-fb16-2042-ef8a06fb3159/ssh/local/fluent-bit.conf"
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.145Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/4715790d-e447-e464-aa3d-232be4a04e6f/ssh/local/fluent-bit.conf"
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.145Z [DEBUG] agent: (runner) checking template e410d2764b6e6460f1f327d1fe83d5b3
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.145Z [DEBUG] agent: (runner) missing data for 1 dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) missing dependency: nomad.service(fluentd)
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) add used dependency nomad.service(fluentd) to missing since isLeader but do not have a watcher
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) was not watching 1 dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.146Z [DEBUG] agent: (watcher) adding nomad.service(fluentd)
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) diffing and updating dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.146Z [DEBUG] agent: (runner) watching 1 dependencies
Jun 09 17:51:27 nomad-ephemeral-production-1 nomad[1439]:     2022-06-09T17:51:27.147Z [ERROR] agent: (runner) watcher reported error: nomad.service(fluentd): Get "http://127.0.0.1:4646/v1/service/fluentd?namespace=default&stale=&wait=60000ms": closed

Maybe the service discovery feature is not reloading correctly to use the new TLS certs?

nomad_GH_13303.txt

Hi @jdoss and thanks for the extra detail. There is a lot to unpack and comment on so i'll try and do this in order with some markers using quotes from your comments.

Does your e2e tests use client auth with TLS?

Yes, the configuration file is available within this repo and matches your setup.

I expect it is saying closed in the error message because the port 4646 isn't speaking http and nomad is trying to access the API over http?

As you show in your example, a well behaving HTTP API and client responds with Client sent an HTTP request to an HTTPS server when a client makes a non-https request to a HTTPS enabled endpoint. The closed message is coming from net.ErrClosed which indicates the template process is making a network call to an already closed connection.

Maybe the service discovery feature is not reloading correctly to use the new TLS certs?

In my previous comment I mentioned how the template runner uses a custom listener and dialer which ties into the RPC layer and is unrelated to the public API where the TLS certificates are used. I therefore still believe the TLS aspect to be a red-herring.

That being said, when reloading the Nomad agent a number of actions occur which sometimes includes connection pool updates for the client -> server RPC connection. Along with the "closed" message being seen, I believe this is the area in which the bug exists. Hypothesis theory being that we are reloading the RPC connection pool without then informing the custom dialer of this change which means it is using a closed connection.

I'll mark this issue as needing a little further investigation to solidify the cause and put it onto our backlog board for roadmapping. Thanks again for your detailed troubleshooting.

No problem @jrasell! I am glad I was able to help and thanks for responding in such detail as well. 😃

Hey @jrasell any update on getting this sorted out? I'd like to move away from Consul service discovery over to Nomad service discovery so I can use it with traefik's new Nomad integration. https://traefik.io/blog/traefik-proxy-fully-integrates-with-hashicorp-nomad/

Hi @jdoss, the team have unfortunately been busy on other prioritised work. We are doing some roadmapping in the coming weeks though, so I will keep this issue in mind and add any updates as they are known.

Hey @jrasell @tgross @lgfa29 I just was submitting a panic issue I found on Nomad v1.5.0-beta.1 and I remembered this issue that is related to using short lived TLS certs. I will do some testing on v1.5.0-beta.1 to see if the issues with Nomad service discovery still exists and I will report back.

Any chance this issue can get some eyes on it after the 1.5.0 GA dust settles if things are still wonky?

I just wanted to follow up issue. I believe this is still a problem which impacts other parts of Nomad such as the Vault Integration.

I am no longer seeing the issue within the Nomad Service discovery feature. The original issue seems to be fixed as the internal RPC connection pool for the Nomad Service discovery seems to be reloading now with the new TLS cert.

This however is an issue with the Vault Integration. I use the same CA for my Nomad and Vault TLS. I used short lived TLS certs (valid for 24hr) and renew them on a systemd timer every 12hr. My vault cluster currently has a valid TLS cert which I can access from the server that is running nomad.

Valid TLS cert on Vault

# step certificate inspect https://xxx.xxx.xxx.xxx:8200
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 211686295302191930181129435020294053738 (0x9f414c288f3c39b68dccb35fc0135f6a)
    Signature Algorithm: ECDSA-SHA256
        Issuer: CN=Internal CA Intermediate CA
        Validity
            Not Before: Apr 14 09:07:44 2023 UTC
            Not After : Apr 15 09:08:44 2023 UTC

I can use mTLS to access Vault from the nomad server using the nomad certs

# curl --key /etc/nomad/tls/nomad.key  --cert /etc/nomad/tls/nomad.crt --cacert /etc/nomad/tls/nomad-ca.crt https://xxx.xxx.xxx.xxx:8200/
<a href="/ui/">Temporary Redirect</a>.

Vault is enabled in Nomad

vault {
  enabled     = true

  ca_file = "/etc/nomad/tls/nomad-ca.crt"
  cert_file = "/etc/nomad/tls/nomad.crt"
  key_file = "/etc/nomad/tls/nomad.key"

  address     = "https://xxx.xxx.xxx.xxx:8200"
  token       = "tokentokentoken"
  create_from_role = "nomad-cluster"
}

I am seeing this in my Nomad logs

# journalctl -u nomad.service -f
Apr 14 17:58:04 nomad[41606]:     2023-04-14T17:58:04.527Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 17:58:04 nomad[41606]:     2023-04-14T17:58:04.527Z [INFO]  nomad.vault: backing off renewal: retry=1m54s
Apr 14 18:00:02 nomad[41606]:     2023-04-14T18:00:02.124Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:00:02 nomad[41606]:     2023-04-14T18:00:02.125Z [INFO]  nomad.vault: backing off renewal: retry=1m58s
Apr 14 18:02:04 nomad[41606]:     2023-04-14T18:02:04.294Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:02:04 nomad[41606]:     2023-04-14T18:02:04.294Z [INFO]  nomad.vault: backing off renewal: retry=1m55s
Apr 14 18:04:02 nomad[41606]:     2023-04-14T18:04:02.944Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:04:02 nomad[41606]:     2023-04-14T18:04:02.944Z [INFO]  nomad.vault: backing off renewal: retry=1m39s
Apr 14 18:05:45 nomad[41606]:     2023-04-14T18:05:45.413Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 18:05:45 nomad[41606]:     2023-04-14T18:05:45.413Z [INFO]  nomad.vault: backing off renewal: retry=1m39s

You can see here the Nomad gets reloaded after the TLS cert renews agent: reloading HTTP server with new TLS configuration but then the old expired cert is still being used and it fails to use mTLS to on the Vault endpoint.

Apr 14 09:13:53 systemd[1]: Reloading nomad.service - Nomad...
Apr 14 09:13:53 nomad[41606]: ==> Caught signal: hangup
Apr 14 09:13:53 nomad[41606]: ==> Reloading configuration...
Apr 14 09:13:53 systemd[1]: Reloaded nomad.service - Nomad.
Apr 14 09:13:53 nomad[41606]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.286Z [INFO]  nomad: reloading server connections due to configuration changes
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.287Z [ERROR] nomad.rpc: failed to accept RPC conn: error="accept tcp xxx.xxx.xxx.xxx:4647: use of closed network connection" delay=5ms
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.287Z [INFO]  nomad.rpc: closing server RPC connection
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.288Z [INFO]  client.fingerprint_mgr: reloading fingerprinter: fingerprinter=cni
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.288Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=xxx.xxx.xxx.xxx:4647
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.288Z [INFO]  agent: reloading HTTP server with new TLS configuration
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.515Z [INFO]  nomad.rpc: closing server RPC connection
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.613Z [INFO]  nomad.rpc: closing server RPC connection
Apr 14 09:13:53 nomad[41606]:     2023-04-14T09:13:53.615Z [INFO]  nomad.rpc: closing server RPC connection
Apr 14 09:14:09 nomad[41606]:     2023-04-14T09:14:09.106Z [INFO]  nomad.rpc: closing server RPC connection
Apr 14 09:14:58 nomad[41606]:     2023-04-14T09:14:58.198Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:14:58 nomad[41606]:     2023-04-14T09:14:58.198Z [INFO]  nomad.vault: backing off renewal: retry=1m48s
Apr 14 09:16:50 nomad[41606]:     2023-04-14T09:16:50.265Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:16:50 nomad[41606]:     2023-04-14T09:16:50.265Z [INFO]  nomad.vault: backing off renewal: retry=1m34s
Apr 14 09:18:27 nomad[41606]:     2023-04-14T09:18:27.729Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:18:27 nomad[41606]:     2023-04-14T09:18:27.729Z [INFO]  nomad.vault: backing off renewal: retry=1m2s
Apr 14 09:19:33 nomad[41606]:     2023-04-14T09:19:33.818Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:19:33 nomad[41606]:     2023-04-14T09:19:33.819Z [INFO]  nomad.vault: backing off renewal: retry=1m20s
Apr 14 09:20:57 nomad[41606]:     2023-04-14T09:20:57.561Z [WARN]  nomad.vault: got error or bad auth, so backing off: error="failed to renew the vault token: Put \"https://xxx.xxx.xxx.xxx:8200/v1/auth/token/renew-self\": remote error: tls: bad certificate" recoverable=true
Apr 14 09:20:57 nomad[41606]:     2023-04-14T09:20:57.561Z [INFO]  nomad.vault: backing off renewal: retry=1m44s

If I restart nomad (not reload) with systemctl restart nomad.service it is no longer getting a remote error: tls: bad certificate error but the vault token has expired since it could not be renewed.

Apr 14 18:21:43 systemd[1]: Stopped nomad.service - Nomad.
Apr 14 18:21:43 systemd[1]: nomad.service: Consumed 11h 35min 26.463s CPU time.
Apr 14 18:21:44 systemd[1]: Starting nomad.service - Nomad...
Apr 14 18:21:44 systemd[1]: Started nomad.service - Nomad.
Apr 14 18:21:44 nomad[119017]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Apr 14 18:21:44 nomad[119017]: ==> Loaded configuration from /etc/nomad/config/nomad.hcl
Apr 14 18:21:44 nomad[119017]: ==> Starting Nomad agent...
Apr 14 18:21:49 nomad[119017]: ==> Nomad agent configuration:
Apr 14 18:21:49 nomad[119017]:        Advertise Addrs: HTTP: xxx.xxx.xxx.xxx:4646; RPC: xxx.xxx.xxx.xxx:4647; Serf: xxx.xxx.xxx.xxx:4648
Apr 14 18:21:49 nomad[119017]:             Bind Addrs: HTTP: [xxx.xxx.xxx.xxx:4646 127.0.0.1:4646]; RPC: xxx.xxx.xxx.xxx:4647; Serf: xxx.xxx.xxx.xxx:4648
Apr 14 18:21:49 nomad[119017]:                 Client: true
Apr 14 18:21:49 nomad[119017]:              Log Level: INFO
Apr 14 18:21:49 nomad[119017]:                 Region: us-central (DC: central)
Apr 14 18:21:49 nomad[119017]:                 Server: true
Apr 14 18:21:49 nomad[119017]:                Version: 1.5.2
Apr 14 18:21:49 nomad[119017]: ==> Nomad agent started! Log data will stream in below:
Apr 14 18:21:49 nomad[119017]:     2023-04-14T18:21:44.629Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
Apr 14 18:21:49 nomad[119017]:     2023-04-14T18:21:44.632Z [INFO]  nomad.raft: starting restore from snapshot: id=45-89569-1681240368076 last-index=89569 last-term=45 size-in-bytes=610130
Apr 14 18:21:49 nomad[119017]:     2023-04-14T18:21:44.653Z [INFO]  nomad.raft: snapshot restore progress: id=45-89569-1681240368076 last-index=89569 last-term=45 size-in-bytes=610130 read-bytes=610130 percent-complete="100.00%"
Apr 14 18:21:49 nomad[119017]:     2023-04-14T18:21:44.653Z [INFO]  nomad.raft: restored from snapshot: id=45-89569-1681240368076 last-index=89569 last-term=45 size-in-bytes=610130
Apr 14 18:21:49 nomad[119017]:     2023-04-14T18:21:44.719Z [ERROR] nomad.vault: failed to validate self token/role: retry=30s
Apr 14 18:21:49 nomad[119017]:   error=
Apr 14 18:21:49 nomad[119017]:   | failed to lookup Vault periodic token: Error making API request.
Apr 14 18:21:49 nomad[119017]:   |
Apr 14 18:21:49 nomad[119017]:   | URL: POST https://xxx.xxx.xxx.xxx:8200/v1/auth/token/lookup
Apr 14 18:21:49 nomad[119017]:   | Code: 403. Errors:
Apr 14 18:21:49 nomad[119017]:   |
Apr 14 18:21:49 nomad[119017]:   | * permission denied
Apr 14 18:21:49 nomad[119017]:

This issue is pretty glaring with short lived TLS certs, but other folks that are using longer TLS certs will most certainly run into this issue as well when they renew their TLS certs and reload Nomad which will cause issues and downtime on Nomad jobs that depend on Vault.

hashicorp / nomad