hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.22k stars 4.41k forks source link

Consul connect service mesh connections may fail after 72 hours #16779

Closed mmeier86 closed 1 year ago

mmeier86 commented 1 year ago

Overview of the Issue

I'm running a Nomad cluster using Consul Connect service mesh for almost all jobs. Since the update of Consul to 1.15.1 from 13.5, every three days, the Consul Connect mesh stops working. This seems to happen reliably 72 hours after a job (and its accompanying Envoy sidecar proxy) have been started.

I have also tried out the "consul troubleshoot proxy" command, and finally found a potential error source:

nsenter -t 691115 -n consul troubleshoot proxy -envoy-admin-endpoint=127.0.0.2:19001 -upstream-ip=127.0.0.1
==> Validation                                             
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ✓ Envoy has 0 rejected configurations
 ✓ Envoy has detected 1464 connection failure(s)
 ! No listener for upstream "127.0.0.1" 
  -> Check that your upstream service is registered with Consul
  -> Make sure your upstream exists by running the `consul[-k8s] troubleshoot upstreams` command
  -> If you are using transparent proxy for this upstream, ensure you have set up allow intentions to the upstream
  -> Check the logs of the Consul agent configuring the local proxy to ensure XDS resources were sent by Consul
 ! No clusters found on route or listener

This seems to fit the "failing 72 hours after job start" symptoms, if indeed the problem is that the mTLS certs are not getting renewed.

Also note that I upgraded directly from 1.3.5 to 1.5.1, including the necessary changes for the breaking grpc_tls change. I did this relatively late update due to experiencing the Nomad config issues with gRPC and the Consul connect envoy proxy setup. Is it possible that I got the grpc_tls config wrong somewhere, and Consul is not able to update the Envoy proxy with new certificates?

One last note: Taking down the downstream job and restarting it (nomad job stop and nomad job start) seems to completely fix the issue.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Setup a Consul and Nomad cluster with three servers and two nodes
  2. Enable Consul Service mesh, using Consul's internal CA as the Connect CA
  3. Launch two jobs, with one using the other as an upstream
  4. Wait for about 72 hours
  5. Observe that the downstream job can no longer access the upstream job through the serve mesh

Consul info for both Client and Server

Client info ``` agent: check_monitors = 0 check_ttls = 0 checks = 11 services = 16 build: prerelease = revision = 7c04b6a0 version = 1.15.1 version_metadata = consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 160 max_procs = 4 os = linux version = go1.20.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 147 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 67106 members = 16 query_queue = 0 query_time = 1 ``` ``` retry_join = ["server1", "server2", "server3"] server = false data_dir = "/path/to/data" addresses { https = "127.0.0.1" } advertise_addr = "10.***" datacenter = "homenet" log_file = "/var/log/consul/" log_rotate_max_files = 10 ports { http = -1 https = 8501 grpc_tls = 8502 } tls { defaults { cert_file = "/etc/homenet-certs/homenet.crt" key_file = "/etc/homenet-certs/homenet.priv" ca_file = "/etc/ssl/certs/homenet-ca.pem" } } acl = { enabled = true default_policy = "deny" enable_token_persistence = true tokens { agent = "agent-token" } } encrypt = "encryption-key" dns_config { node_ttl = "120s" service_ttl { "*" = "10s" } soa { min_ttl = 5 } } log_level = "info" connect { enabled = true } peering { enabled = false } ```
Server info ``` agent: check_monitors = 0 check_ttls = 1 checks = 4 services = 4 build: prerelease = revision = 7c04b6a0 version = 1.15.1 version_metadata = consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.***:8300 server = true raft: applied_index = 40094594 commit_index = 40094594 fsm_pending = 0 last_contact = 58.855966ms last_log_index = 40094594 last_log_term = 356 last_snapshot_index = 40086560 last_snapshot_term = 356 latest_configuration = [{Suffrage:Voter ID:5d47ca91-6753-e044-4635-8fab8015d58c Address:server1:8300} {Suffrage:Voter ID:a412ed8e-3da1-a8c0-b968-b2d8e0327614 Address:server2:8300} {Suffrage:Voter ID:193eb5bd-3ddc-ca51-7b29 -9a1031a8b433 Address:server3:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 356 runtime: arch = arm64 cpu_count = 4 goroutines = 386 max_procs = 4 os = linux version = go1.20.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 147 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 67106 members = 16 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2019 members = 3 query_queue = 0 query_time = 1 ``` ``` retry_join = ["server1", "server2", "server3"] server = true data_dir = "/data/dir" bootstrap_expect = 3 ui_config { enabled = true } addresses { dns = "0.0.0.0" https = "0.0.0.0" } advertise_addr = "10.***" datacenter = "homenet" log_file = "/var/log/consul/" log_rotate_max_files = 10 ports { http = -1 https = 8501 grpc_tls = 8502 } tls { defaults { cert_file = "/etc/homenet-certs/homenet.crt" key_file = "/etc/homenet-certs/homenet.priv" ca_file = "/etc/ssl/certs/homenet-ca.pem" } } acl = { enabled = true default_policy = "deny" enable_token_persistence = true tokens { agent = "****" # Special DNS token so that the Consul server can read all services and nodes # and answer DNS queries. default = "***" } } encrypt = "***" dns_config { node_ttl = "120s" service_ttl { "*" = "10s" } soa { min_ttl = 5 } } log_level = "info" connect { enabled = true } peering { enabled = false } ```

Operating system and Environment details

Both the three servers and the clients are mostly running on Raspberry Pis, with a couple of x86 nodes in the mix as well.

Log Fragments

This is the big problem I'm seeing: There is nothing in any of the logs. Both the Consul servers and the Consul clients are only showing the standard "service check synced" messages. Just going by the logs, there is absolutely nothing wrong with the system. The same goes for the Envoy proxy sidecars in the Nomad jobs. They do not have any log output at all after initial startup.

Hm, looking further back in the logs, I saw this line in the Consul client log multiple times:

[WARN]  agent.cache: handling error in Cache.Notify: cache-type=trust-bundles error="rpc error: code = Unavailable desc = the connection is draining" index=1
radykal-com commented 1 year ago

We are facing the same problem just with consul 1.15.1 (not using nomad)

consul troubleshoot proxy -upstream-ip=127.0.0.1
==> Validation
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ✓ Envoy has 0 rejected configurations
 ✓ Envoy has detected 128536 connection failure(s)

This wasn't happening with consul 1.14.4

kisunji commented 1 year ago

Thank you for the report. Could you share what envoy command you are using to run the sidecars?

And perhaps run a separate consul monitor -log-level debug process to see if any debug logs could help.

The trust-bundles warn log relates to peering-related caching so I think it may be a red herring.

radykal-com commented 1 year ago

From our side we run it as consul connect envoy -sidecar-for=service-name-here.

As a temporary fix we just changed the leaf certificates TTL to some months until it's fixed. With the new TTL it's posible that no certificate rotation would happen under normal circumstances and so not going to see anything in the logs.

magnetarnix commented 1 year ago

We are also facing the same problem with Nomad + Consul + Vault. The problem occurred for us when upgrading from Consul 1.14.3 to Consul 1.15.0. During the upgrade process we also enabled ACLs but I don't think it's related to this issue.

We are now running: Nomad v1.4.4 Consul v1.15.0 Envoy v1.24.0 (the default version started by Nomad)

For us the sidecars are run by Nomad for Consul Connect with the default options.

Note that we found out that restarting the Consul clients actually results in Envoy losing connection to Consul and then reconnecting, with certificates being renewed. However just restarting the Consul Connect sidecar task doesn't result in the certificate being renewed. Envoy logs when restarting the Consul client:

[2023-03-24 15:21:57.745][1][info][upstream] [source/server/lds_api.cc:82] lds: add/update listener 'public_listener:0.0.0.0:22180'
[2023-03-24 15:21:57.745][1][info][config] [source/server/listener_manager_impl.cc:831] all dependencies initialized. starting workers
[2023-03-24 15:36:57.739][1][info][main] [source/server/drain_manager_impl.cc:171] shutting down parent after drain
[2023-03-27 14:34:21.191][1][warning][config] [./source/common/config/grpc_stream.h:163] DeltaAggregatedResources gRPC config stream to local_agent closed: 13, 
[2023-03-27 14:34:26.482][1][info][upstream] [source/common/upstream/cds_api_helper.cc:35] cds: add 0 cluster(s), remove 0 cluster(s)
[2023-03-27 14:34:26.482][1][info][upstream] [source/common/upstream/cds_api_helper.cc:72] cds: added/updated 0 cluster(s), skipped 0 unmodified cluster(s)
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.487][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.google_re2' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-03-27 14:34:26.488][1][info][upstream] [source/server/lds_api.cc:82] lds: add/update listener 'public_listener:0.0.0.0:22180'

After a quick look we don't see any obvious issue in the debug logs, but they are so verbose that we might be missing them.

kisunji commented 1 year ago

If anyone encounters this bug where mesh connectivity fails for a service, could they try calling <http_address>/v1/agent/connect/ca/leaf/<service_name>?index=1&wait=1s and paste the results with private key PEM removed?

magnetarnix commented 1 year ago

When I first got the issue I tried to curl 127.0.0.1:8500/v1/agent/connect/ca/leaf/<service_name> and the certificate returned by Consul was not expired (and it was actually very recent), while the certificate I got when trying to openssl s_client -showcerts -connect <exposed IP:Port of Envoy sidecar for service_name> was expired. I just renewed all my certificates (by restarting the Consul clients) so I can't check yet with adding ?index=1&wait=1s

mmeier86 commented 1 year ago

I'm also running the Envoy sidecar through Nomad with the default options.

I will set up a consul monitor later tonight. I will also provide the output of the curl on the leaf cert. I expect this to happen again in my cluster only on Wednesday, as I restarted all of my services yesterday night after I opened the issue, and it happens about 72 hours after the service starts.

jcpreston26 commented 1 year ago

We have also been experiencing this over the past couple of weeks. We just had a failure about ten/fifteen minutes ago. Our output of the leaf URL for the failed service is this (sec team requested I redact the certs as well. Please let me know if you need that data)

{"SerialNumber":"03:58:c3","CertPEM":"-----BEGIN CERTIFICATE-----\nCERT1-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\nCERT2\n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\nCERT3\n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\nCERT4-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\nCERT5\n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\nCERT6\n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\nCERT7\n-----END CERTIFICATE-----\n","PrivateKeyPEM":"-----BEGIN EC PRIVATE KEY-----\nREDACTEDREDACTED\n-----END EC PRIVATE KEY-----\n","Service":"REDACTED","ServiceURI":"spiffe://be1573cd-4b58-c1c1-c9ed-098960a0b4d3.consul/ns/default/dc/si/svc/REDACTED","ValidAfter":"2023-03-24T19:58:49Z","ValidBefore":"2023-03-27T19:58:49Z","CreateIndex":122138967,"ModifyIndex":122138967}

Consul: 1.15.1 Nomad: 1.5.2 Envoy: 1.25.1

jcpreston26 commented 1 year ago

Behavior we have seen is exactly the same as from the OP. Stopping and restarting the Nomad job resolves the situation every single time. This has happened occasionally in the past, but ever since we upgraded to Consul 1.15, it has been almost every single time the certificates need to be renewed. We've been trying to get some logging out of it to debug the situation, but we're just not seeing anything indicative of failure in Consul, Nomad, or Envoy logging until we get to the failure point and we start seeing the certificate expirations within the Envoy logs.

If there's any additional logging we can provide, I would be glad to.

jcpreston26 commented 1 year ago

Further examination shows that: the response when accessing the Consul client instance on that Nomad machine is not the same as when I hit the Consul servers. The Consul server cluster has the new, updated certificate (expiring March 30), but the Consul client still has the (March 27) expiration.

When stopping the job in Nomad, the Consul client still returns the March 27 expiration. As soon as I restart the job in Nomad, the Consul client returns March 30 expiration.

kisunji commented 1 year ago

Thank you @jcpreston26 for the clarification. That seems to fit @magnetarnix 's findings and our internal testing as well: the leaf cert rotation is happening correctly in the server but the client is not seeing the updates. We will continue to investigate and keep everyone updated.

magnetarnix commented 1 year ago

I seem to observe a different behavior, it has been about 24h since I last renewed certificates for my cluster, and I get the following:

So it seems like for me the certificate returned by Consul is up to date, including on the Consul clients, but the Envoy certificate is not updated.

Note that IIRC when I ran curl 127.0.0.1:8500/v1/agent/connect/ca/leaf/service_name when I first noticed my certificates were expired I did it on the Consul client where the service was running on, not on a control plane server, and the certificate returned by Consul was up to date. So it seems to me like a communication issue between Envoy and Consul, and not between Consul servers and clients.

FelipeEmerim commented 1 year ago

We seem to be getting this on Kubernetes (Azure AKS) with consul (1.15.0) + consul-k8s (1.1.0) as well. Querying the consul server <http_address>/v1/agent/connect/ca/leaf/<service_name>?index=1&wait=1s returns a valid certificate. However, the new certificate is not propagated to the service, which is still using a certificate that expired weeks ago.

After restarting the pods, we saw that they got the updated certificate, it seems whatever is responsible for propagating mTLS certificates is not able to do it on consul >= 1.15. This was not an issue on consul 1.14.x as we have that version running on another cluster for about a month and have not seen this issue.

It is also worth mentioning that performing a rollout on the consul-server statefulset also updates the certiicates in the service pods, without even restaring the service pods.

mmeier86 commented 1 year ago

I can confirm the previous findings here as well. The certs returned by the Consul Server and Client are both new, the certs returned when directly querying the Envoy port are going to expire in about 1.5 hours. I'm looking through the debug logs at the moment, but I'm not seeing anything.

I will leave it running through the full service restart after the certs expire, perhaps something interesting shows up in the Consul logs during Envoy startup.

mmeier86 commented 1 year ago

Nothing interesting during the service restart either. But I saw this line appear multiple times:

2023-03-30T00:19:26.874+0200 [ERROR] agent.http: Request error: method=GET url=/v1/config/proxy-defaults/global from=127.0.0.1:60878 error="Config entry not found for \"proxy-defaults\" / \"global\""

jkirschner-hashicorp commented 1 year ago

@mmeier86 : Given that at least one poster has experienced this without Nomad being involved, I'm going to remove "in Nomad cluster" from the title to make this Github issue the central discussion point for this topic.

Lord-Y commented 1 year ago

I'm having the same issue in production but it take like 1 hour maximum before failing

hashi-derek commented 1 year ago

We have submitted a PR to revert some behavior that was changed between 1.14 and 1.15, which we believe is the problematic area of code, and should be releasing a patch before Monday. We thank you all for your patience and assistance on this issue. I also notice that some of the users in this thread have helped us in the past with other issues, and to that I must say that I greatly appreciate your continued support of Consul and the open source community.

After lots of testing, the issue appears to be a race-condition, which makes it tricky to isolate and consistently reproduce. While we have been able to locate a timing problem in the cache, it may not necessarily be the same that you all are experiencing. We will continue to investigate this issue after the patch to ensure it is resolved properly. Because of this, it would be beneficial if you could provide extra details of your environment such as:

  1. When the issue occurs, does it only affect sidecars for a single agent?
  2. Are all instances of a particular service affected simultaneously?
  3. Roughly how many service instances do you have deployed?
  4. What is the configured leaf cert TTL for your environment?
  5. Are ACLs enabled for your system? If so, are multiple service instances using the same token?
  6. Are you using the Vault CA or the built-in Connect CA?

I understand that due to privacy concerns, you may not be able to provide all of the above answers. However, any info that you can provide will be greatly appreciated and allow us to correlate and reproduce the issue more consistently.

FelipeEmerim commented 1 year ago

Thanks for the update!

We can deploy the fix in our environment to test it once it is released. As for the details you asked:

  1. When the issue occurs, does it only affect sidecars for a single agent?

I did not check this when the issue occurred. It seems very likely as we've had this issue for many different sidecars. I am not sure if this question applies to us as we only have server agents.

  1. Are all instances of a particular service affected simultaneously?

Yes. All instances of the services we checked were using expired certificates. If we increase the replica count of a service however, the new instances are able to get the updated certificates while the existing instances will continue to use the expired certs.

  1. Roughly how many service instances do you have deployed?

We've had this in two environments. One has more than 50 services with 2-5 instances each while the other has only 5 services with 2-5 instances each. Both environments have 3 Consul servers.

  1. What is the configured leaf cert TTL for your environment?

We use the default TTL of 72h.

  1. Are ACLs enabled for your system? If so, are multiple service instances using the same token?

We don't have ACLs enabled.

  1. Are you using the Vault CA or the built-in Connect CA?

We are using the built-in Connect CA.

Additional details:

jcpreston26 commented 1 year ago

1 and 2. All agents, eventually. Not all service instances have the same certificate at once; this is due to restarts, I assume. But all agents and all services eventually have expired certificates.

  1. A couple of different environments with approximately 10 Nomad jobs each. Each job is running between 1 and 4 allocations. Each Nomad client node has its own Consul client; the Consul server cluster has 3 servers.
  2. Default 72h.
  3. Consul ACLs are enabled. Each Nomad client is configured with the same ACL.
  4. We are using the built-in Connect CA.
mmeier86 commented 1 year ago

For what it's worth, for me the issue is completely reproducible, every 72h after a service restart happens.

1. When the issue occurs, does it only affect sidecars for a single agent?

No, I've got 9 Consul agents on 9 physical machines, it happens for all of them.

  1. Are all instances of a particular service affected simultaneously? Sadly I can't answer this, as I don't have any multi-instance services.
  2. Roughly how many service instances do you have deployed? It differs. I've got a total of 40 services registered with my Consul cluster. 21 of them have Consul connect sidecars, one (Traefik) is Consul connect Native. All of these are launched and registered via Nomad. The rest are either also Nomad jobs but outside the mesh, or manually registered services.
  3. What is the configured leaf cert TTL for your environment? I have the default value, which seems to be 72 hours, going by the Cert TTLs I'm seeing.
  4. Are ACLs enabled for your system? If so, are multiple service instances using the same token? Yes, I have ACLs enabled in my cluster. And I believe all Consul connect services use the same token, as they are registered by Nomad, so they would all use the Consul token I have configured in Nomad.
  5. Are you using the Vault CA or the built-in Connect CA? The build-in connect CA.

On the topic of "thanks for the assistance": I get to use a pretty nice service discovery and mesh networking tool for free in my Homelab - the least I can do is write useful bug reports and assist in the resolution. :-)

radykal-com commented 1 year ago
  1. When the issue occurs, does it only affect sidecars for a single agent?

No, it happens to all agents at the time their leaf certificate expire.

  1. Are all instances of a particular service affected simultaneously?

Yes if all instances are launched at the same time. If a new instance is launched once the leaf certificate is in rotation period, then it grabs a new one.

  1. Roughly how many service instances do you have deployed?

~70 client instances + 3 consul servers

  1. What is the configured leaf cert TTL for your environment?

Default 72h

  1. Are ACLs enabled for your system? If so, are multiple service instances using the same token?

No ACL enabled

  1. Are you using the Vault CA or the built-in Connect CA?

built-in Connect CA

hashi-derek commented 1 year ago

Version 1.15.2 was published last night and should have the revert for those who are interested in trying it out. Thank you again for all of your support, and please let us know if it resolved your problem.

magnetarnix commented 1 year ago
1. When the issue occurs, does it only affect sidecars for a single agent?

All agents (30+) are affected and show the same behavior

2. Are all instances of a particular service affected simultaneously?

Not necessarily, if they are launched at the same time then yes, but more generally it happens 72h after each instance is started.

3. Roughly how many service instances do you have deployed?

1000+ in one cluster, 250+ in another, both are impacted

4. What is the configured leaf cert TTL for your environment?

The default 72h

5. Are ACLs enabled for your system? If so, are multiple service instances using the same token?

Yes ACLs are enabled, and they got enabled during our upgrade to Consul 1.15.0.

6. Are you using the Vault CA or the built-in Connect CA?

Vault CA

mmeier86 commented 1 year ago

I believe I have tentative good news: I updated on Friday evening to 1.15.2, without restarting my Nomad jobs, just the Consul servers and agents. All connect certificates got renewed during the Consul restart, but I wasn't sure whether that was due to the issue being fixed, or simply due to the Consul restart.

The expiration date of the new certs was April 3rd, 19:56 UTC. I just checked the certs of all services again, and the issue may be fixed now, as they all show an expiration date of April 5th, 17:44 UTC. No service or Consul restarts happened since the initial expiration date check on Friday.

I verified all expiration dates with openssl s_client against the Envoy proxy's exposed ports.

jcpreston26 commented 1 year ago

I can confirm the same behavior as being reported by @mmeier86. Did the patch on Friday evening, saw certs due on April 3rd, now seeing a portion of them beginning to be renewed (for April 5th). No restarts of those services/Nomad jobs were performed today.

radykal-com commented 1 year ago

@jcpreston26 not all certificates expiring on 3rd have been rotated to the new ones? only some of them?

magnetarnix commented 1 year ago

In case it's useful to someone I use this (ugly) bash 1-liner to check the expiration date of all of the Envoy sidecars in the cluster:

consul catalog services | while read service ; do echo "* $service" ; curl -XGET -sS --cert "$CONSUL_CLIENT_CERT" --key "$CONSUL_CLIENT_KEY" --cacert "$CONSUL_CACERT" "$CONSUL_HTTP_ADDR/v1/catalog/connect/$service" 2>/dev/null | jq -r '.[] | [.ServiceAddress,.ServicePort] | join(":")' | xargs -rn1 timeout --signal=9 5 openssl s_client -showcerts -connect 2>/dev/null | grep -oE 'NotAfter:.*2023.*' ; done

(the 2023 in grep -oE 'NotAfter:.*2023.*' is to filter out the CA from the certificate chain, it expires in 2024 in my case)

It doesn't cover everything (like terminating gateways), but it's good enough for a quick check of which services are going to expire when. You can grep 'NotAfter' | sort -n the result to get an ~ordered list of the expiration dates for a quick visual check of if you have services that are going to expire soon.

jcpreston26 commented 1 year ago

@radykal-com: no, not all. But more were renewed this morning (some now showing up for April 6th expirations). The remaining April 3 certs are due to be expired this evening, so I can confirm that they updated properly around that timeframe, but would expect that these would be complete by sometime late this afternoon.

FelipeEmerim commented 1 year ago

We have also deployed version 1.15.2 2 days ago. Even though the certs expire tomorrow, we saw that a few services are already using a newer certificate. We will keep monitoring until tomorrow to see if every service will rotate the certificates correctly.

hashi-derek commented 1 year ago

Thank you all for the trying out the new build and keeping us informed. I'm glad to hear that things are going smoothly so far for you.

Lord-Y commented 1 year ago

Hello guys,

Here is our stack: nomad: 1.5.1+ent consul: 1.15.2+ent envoy: 1.25.1

I installed consul 1.15.2 in production on consul server and nomad (server and clients) on monday in order to fix the issue that we are all having. I stop/start all our deployments on nomad Everything was back to up before 12:00pm CEST

Today, at 12:00pm CEST I checked the status of the platform on nomad clients with:

export CONSUL_CLIENT_CERT=xxx
export CONSUL_CACERT=xxx
export CONSUL_CLIENT_KEY=xxx
export CONSUL_HTTP_ADDR=http://127.0.0.1:8500
export CONSUL_HTTP_TOKEN=xxx
export namespace=production
for service in $(consul catalog services -namespace $namespace)
do
echo namespace $namespace service $service
curl -sH "X-Consul-Token: $CONSUL_HTTP_TOKEN" "127.0.0.1:8500/v1/agent/connect/ca/leaf/$service?index=1&wait=1s&ns=$namespace" | jq -r .ValidAfter,.ValidBefore
done
date
# result:

app1
2023-04-06T09:47:21Z
2023-04-09T09:47:21Z
app1-sidecar-proxy
2023-04-06T09:47:21Z
2023-04-09T09:47:21Z
app2-ssr
2023-04-06T09:47:21Z
2023-04-09T09:47:21Z
app2-sidecar-proxy
2023-04-06T09:47:21Z
2023-04-09T09:47:21Z
Thu Apr  6 09:48:30 UTC 2023

Somehow, we were still having still having upstream errors So I get into some envoy sidecar proxies in order to check the validity of the certificates and here is the result:

root@4f157f92fbac:/# ./consul troubleshoot proxy -envoy-admin-endpoint=127.0.0.2:19003 -upstream-ip=127.0.0.1
==> Validation
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ! Certificate chain is expired
  -> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
 ✓ Envoy has 0 rejected configurations
 ✓ Envoy has detected 124 connection failure(s)
 ! No listener for upstream "127.0.0.1"
  -> Check that your upstream service is registered with Consul
  -> Make sure your upstream exists by running the `consul[-k8s] troubleshoot upstreams` command
  -> If you are using transparent proxy for this upstream, ensure you have set up allow intentions to the upstream
  -> Check the logs of the Consul agent configuring the local proxy to ensure XDS resources were sent by Consul
 ! No clusters found on route or listener```

Beside the fact that the certs was good the consul side, somehow it was not on envoy side.

So I decided to change the leaf certificate ttl with:

 connect {
   ca_config {
     leaf_cert_ttl = 2190h
   }
 }

or

  connect {
   ca_config = {
     leaf_cert_ttl = 2190h
   }
 }

Consul validate both configs but consul connect ca set-config was still showing 72h. I needed to override the config with:

 cat >config<<EOF
 {
  "Provider": "consul",
  "Config": {
    "IntermediateCertTTL": "8760h",
    "LeafCertTTL": "2190h",
    "RootCertTTL": "87600h"
  },
  "ForceWithoutCrossSigning": false
}
EOF

consul connect ca set-config -config-file=config
consul connect ca get-config

I stopped/started applications in nomad but command curl -sH "X-Consul-Token: $CONSUL_HTTP_TOKEN" "127.0.0.1:8500/v1/agent/connect/ca/leaf/$service?index=1&wait=1s&ns=$namespace" | jq -r .ValidAfter,.ValidBefore was still showing me certs with 72h. So I needed to restart consul client on both nomad server and client and then stop/start applications. Certificates are now:

2023-04-06T13:00:57Z
2023-07-06T19:00:57Z

So on our platform, we will way that the issue is trully fixed.

FelipeEmerim commented 1 year ago

In our case it seems all certs were updated. Running the command written on @magnetarnix's comment in a consul server, we see that all certs were renewed. We manually checked some services by opening a shell in their pods and using openssl and found that they were indeed using the updated certs. We also have this version in our dev environment and have not received any reports from our dev teams.

So far, at least on AKS, things seems to be working. We will continue monitoring this for few more cert renew cycles to be sure, specially since some people here reported that the fix did not work for them.

jkirschner-hashicorp commented 1 year ago

The remaining April 3 certs are due to be expired this evening, so I can confirm that they updated properly around that timeframe, but would expect that these would be complete by sometime late this afternoon.

@jcpreston26 : Did those remaining certs renew themselves before their Apr 3 evening expiration? Does renewal seem to be fully working as expected in your environment now?

jcpreston26 commented 1 year ago

@jkirschner-hashicorp Yes. Everything appears to be working as expected with 1.15.2. A couple of the services that were previously affected now have expiration dates of the 8th and 9th and they have been updating as they go.

jkirschner-hashicorp commented 1 year ago

I decided to change the leaf certificate ttl with: connect { ca_config { leaf_cert_ttl = 2190h } }

@Lord-Y : It seems like you changed the leaf_cert_ttl in a server agent config file. What steps did you take to apply that config? (e.g., Restarting or reloading any of the server agents? Did the leader have this new config?) I want to understand whether there's a separate issue there.

And just to double-check, are/were you still experiencing the leaf cert renewal issue on 1.15.2? Or because your leaf cert ttl is now set to 2190h, do you not really have a means to check quickly (because the lifetime is now much longer than 3 days)?

Lord-Y commented 1 year ago

@jkirschner-hashicorp It seems like you changed the leaf_cert_ttl in a server agent config file. What steps did you take to apply that config? After changing the config, I restarted consul on all agents im rolling restart mode.

Did the leader have this new config ? No that why I used consul connect ca set-config -config-file=config.

And just to double-check, are/were you still experiencing the leaf cert renewal issue on 1.15.2? Without changing leaf_cert_ttl config, consul certs was renewed but the envoy sidecar certs was still expired as said in the block So I get into some envoy sidecar proxies in order to check the validity of the certificates and here is the result: root@4f157f92fbac:/# ./consul troubleshoot proxy -envoy-admin-endpoint=127.0.0.2:19003 -upstream-ip=127.0.0.1 ==> Validation ! Certificate chain is expired

After putting 3 months for the leaf_cert_ttl production has been fine so far.

We are trying to have a Google Meet with from the support Team and add you in the loop.

kisunji commented 1 year ago

Unfortunately, consul troubleshoot proxy may be misleading because it naively checks all certificate chains that Envoy has received. So the "Certificate chain is expired" messages you are seeing may be older expired chains that haven't been garbage collected and are not by themselves proof that Envoy does not have the latest certs. We are aiming to improve this in a future release.

In the meantime, we recommend using Envoy's admin API directly to troubleshoot certificate issues.

kisunji commented 1 year ago

@Lord-Y given my previous comment, it's difficult to confirm if your initial upstream errors while using 1.15.2 were due to this particular Github issue. Please do keep us updated if anything happens near your new leaf cert expiry dates.

I'd like to clarify that changing the leaf_cerf_ttl on server agents will take effect on new leaf certificate signings. Existing services will continue to have the original TTL until they near their expiry, which is when we generate a new leaf cert with the updated TTL.

magnetarnix commented 1 year ago

I can confirm that upgrading from Consul 1.15.0 to Consul 1.15.2 seems to have fixed the problem for us. So far after 72 hours Envoy certificates are renewed as they should.

david-yu commented 1 year ago

Hi everyone, thank you for your patience and also helping validate that this issue no longer exists on Consul 1.15.2. We have updated our docs to reflect this: https://github.com/hashicorp/consul/pull/17020 and will go ahead and close this issue now that we have received multiple validations that it seems to resolve the issue.