hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
670 stars 324 forks source link

TLS error 268435612 when communicating between pods in same cluster with connect/transparent proxy enabled #3914

Closed craigbehnke closed 7 months ago

craigbehnke commented 7 months ago

Question

I am currently getting a TLS issue when trying to communicate between two of my services (router -> initial-context) while using sidecars/transparent proxy. The request is being made from the router to the initial context service using the following URL: http://initial-context.service.consul/graph.

Relevant log output from the initial context service's consul-dataplane ```txt 2024-04-14T18:45:42.598Z+00:00 [debug] envoy.conn_handler(24) [Tags: "ConnectionId":"20791"] new connection from 10.244.1.241:47238 2024-04-14T18:45:42.599Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"20791"] remote address:10.244.1.241:47238,TLS_error:|268435612:SSL routines:OPENSSL_internal:HTTP_REQUEST:TLS_error_end 2024-04-14T18:45:42.599Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"20791"] closing socket: 0 2024-04-14T18:45:42.599Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"20791"] remote address:10.244.1.241:47238,TLS_error:|268435612:SSL routines:OPENSSL_internal:HTTP_REQUEST:TLS_error_end:TLS_error_end 2024-04-14T18:45:42.599Z+00:00 [debug] envoy.conn_handler(24) [Tags: "ConnectionId":"20791"] adding to cleanup list 2024-04-14T18:45:46.272Z+00:00 [debug] envoy.main(15) flushing stats ```

So far, everything I've tried has not worked, and when I search Google for the error I find remarkably few results.

How can I fix this to get the mesh (with TLS verification) working?

CLI Commands (consul-k8s, consul-k8s-control-plane, helm)

consul validate ```txt consul validate /consul/config "autopilot.disable_upgrade_migration" is a Consul Enterprise configuration and will have no effect BootstrapExpect is set to 1; this is the same as Bootstrap mode. bootstrap = true: do not enable unless necessary if auto_encrypt.allow_tls is turned on, tls.internal_rpc.verify_incoming should be enabled (either explicitly or via tls.defaults.verify_incoming). It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards. Configuration is valid! ```
consul-k8s status ```txt consul-k8s status ==> Consul Status Summary Name Namespace Status Chart Version AppVersion Revision Last Updated consul consul deployed 1.4.1 1.18.1 6 2024/04/12 22:41:42 UTC ==> Config: (see config below in Helm Configuration) ==> Status Of Helm Hooks: consul-gossip-encryption-autogenerate ServiceAccount: Succeeded consul-tls-init ServiceAccount: Succeeded consul-gossip-encryption-autogenerate Role: Succeeded consul-tls-init Role: Succeeded consul-gossip-encryption-autogenerate RoleBinding: Succeeded consul-tls-init RoleBinding: Succeeded consul-gossip-encryption-autogenerate Job: Succeeded consul-tls-init Job: Succeeded Consul servers healthy 1/1 ```
consul-k8s troubleshoot upstreams (router) ```txt consul-k8s troubleshoot upstreams -pod router-1234abcd -n my-namespace ==> Upstreams (explicit upstreams only) (0) ==> Upstream IPs (transparent proxy only) (6) IPs Virtual Cluster Names 10.245.103.70, 240.0.0.3 true service1.default.(...).consul 10.245.187.44, 240.0.0.5 true service2.default.(...).consul 10.245.228.123, 240.0.0.2 true initial-context.default.(...).consul 10.245.45.217, 240.0.0.6 true service3.default.(...).consul 10.245.6.148, 240.0.0.1 true service4.default.(...).consul 10.245.63.230, 240.0.0.4 true service5.default.(...).consul If you cannot find the upstream address or cluster for a transparent proxy upstream: -> Check intentions: Transparent proxy upstreams are configured based on intentions. Make sure you have configured intentions to allow traffic to your upstream. -> To check that the right cluster is being dialed, run a DNS lookup for the upstream you are dialing. For example, run `dig backend.svc.consul` to return the IP address for the `backend` service. If the address you get from that is missing from the upstream IPs, it means that your proxy may be misconfigured. ```
consul-k8s troubleshoot proxy (router --> initial-context) ```txt consul-k8s troubleshoot proxy -upstream-ip 10.245.228.123 -pod router-1234abcd -n my-namespace ==> Validation ✓ Certificates are valid ✓ Envoy has 0 rejected configurations ✓ Envoy has detected 0 connection failure(s) ✓ Listener for upstream "10.245.228.123" found ✓ Cluster "initial-context.default.(...).consul" for upstream "10.245.228.123" found ✓ Healthy endpoints for cluster "initial-context.default.(...).consul" for upstream "10.245.228.123" found ✓ Upstream resources are valid ```

If there are more command outputs you would like to see, let me know and I will add them here.

Helm Configuration

values.yaml ```yaml global: name: consul logLevel: "debug" recursors: - "8.8.8.8" - "8.8.4.4" metrics: enabled: true enableAgentMetrics: true enableHostMetrics: true enableGatewayMetrics: true enableTelemetryCollector: false gossipEncryption: autoGenerate: true tls: enabled: true enableAutoEncrypt: true verify: true acls: manageSystemACLs: true syncCatalog: enabled: true default: false toConsul: true toK8s: false server: replicas: 1 connect: true exposeService: enabled: false client: enable: true dns: enabled: true enableRedirection: true ui: enabled: true connectInject: enabled: true default: true aclBindingRuleSelector: "" transparentProxy: defaultEnabled: true defaultOverwriteProbes: true apiGateway: manageExternalCRDs: true manageNonStandardCRDs: false sidecarProxy: resources: requests: cpu: 25m memory: 60Mi limits: cpu: 50m memory: 60Mi namespaceSelector: | matchLabels: consul-enabled : enabled metrics: defaultEnabled: true defaultEnableMerging: true defaultPrometheusScrapePort: 20200 defaultPrometheusScrapePath: "/metrics" ingressGateways: enabled: false terminatingGateways: defaults: resources: requests: memory: "70Mi" cpu: "50m" limits: memory: "70Mi" cpu: "50m" enabled: true logLevel: "trace" telemetryCollector: enabled: false ```

Logs

Relevant logs from router (consul-dataplane) I have verified that 10.244.0.66 is the initial-context pod IP. ```txt 2024-04-15T02:44:21.237Z [DEBUG] consul-dataplane.dns-proxy.udp: dns messaged received from consul: length=151 2024-04-15T02:44:21.248Z [DEBUG] consul-dataplane.dns-proxy.udp: dns messaged received from consul: length=151 2024-04-15T02:44:21.259Z [DEBUG] consul-dataplane.dns-proxy.udp: dns messaged received from consul: length=141 2024-04-15T02:44:21.270Z [DEBUG] consul-dataplane.dns-proxy.udp: dns messaged received from consul: length=141 2024-04-15T02:44:21.281Z [DEBUG] consul-dataplane.dns-proxy.udp: dns messaged received from consul: length=137 2024-04-15T02:44:21.335Z [DEBUG] consul-dataplane.dns-proxy.udp: dns messaged received from consul: length=137 2024-04-15T02:44:21.338Z [DEBUG] consul-dataplane.dns-proxy.udp: dns messaged received from consul: length=64 2024-04-15T02:44:21.339Z+00:00 [debug] envoy.filter(24) original_dst: set destination to 10.244.0.66:80 2024-04-15T02:44:21.339Z+00:00 [debug] envoy.filter(24) [Tags: "ConnectionId":"17831"] new tcp proxy session 2024-04-15T02:44:21.339Z+00:00 [debug] envoy.filter(24) [Tags: "ConnectionId":"17831"] Creating connection to cluster original-destination 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.upstream(24) transport socket match, socket default selected for host with address 10.244.0.66:80 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.upstream(24) Created host original-destination10.244.0.66:80 10.244.0.66:80. 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.misc(24) Allocating TCP conn pool 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.upstream(15) addHost() adding original-destination10.244.0.66:80 10.244.0.66:80. 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.upstream(15) membership update for TLS cluster original-destination added 1 removed 0 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.upstream(15) re-creating local LB for TLS cluster original-destination 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.upstream(23) membership update for TLS cluster original-destination added 1 removed 0 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.upstream(23) re-creating local LB for TLS cluster original-destination 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.pool(24) trying to create new connection 2024-04-15T02:44:21.340Z+00:00 [debug] envoy.pool(24) creating a new connection (connecting=0) 2024-04-15T02:44:21.341Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"17832"] connecting to 10.244.0.66:80 2024-04-15T02:44:21.341Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"17832"] connection in progress 2024-04-15T02:44:21.341Z+00:00 [debug] envoy.conn_handler(24) [Tags: "ConnectionId":"17831"] new connection from 10.244.1.241:52984 2024-04-15T02:44:21.341Z+00:00 [debug] envoy.upstream(24) membership update for TLS cluster original-destination added 1 removed 0 2024-04-15T02:44:21.342Z+00:00 [debug] envoy.upstream(24) re-creating local LB for TLS cluster original-destination 2024-04-15T02:44:21.345Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"17832"] connected 2024-04-15T02:44:21.345Z+00:00 [debug] envoy.pool(24) [Tags: "ConnectionId":"17832"] attaching to next stream 2024-04-15T02:44:21.345Z+00:00 [debug] envoy.pool(24) [Tags: "ConnectionId":"17832"] creating stream 2024-04-15T02:44:21.345Z+00:00 [debug] envoy.router(24) Attached upstream connection [C17832] to downstream connection [C17831] 2024-04-15T02:44:21.346Z+00:00 [debug] envoy.filter(24) [Tags: "ConnectionId":"17831"] TCP:onUpstreamEvent(), requestedServerName: 2024-04-15T02:44:21.349Z+00:00 [debug] envoy.router(23) [Tags: "ConnectionId":"17816","StreamId":"11384318054642815300"] upstream headers complete: end_stream=false 2024-04-15T02:44:21.349Z+00:00 [debug] envoy.http(23) [Tags: "ConnectionId":"17816","StreamId":"11384318054642815300"] encoding headers via codec (end_stream=false): ':status', '200' 'content-type', 'application/json' 'vary', 'origin' 'content-encoding', 'gzip' 'access-control-allow-origin', '*' 'date', 'Mon, 15 Apr 2024 02:44:21 GMT' 'x-envoy-upstream-service-time', '151' 'server', 'envoy' 2024-04-15T02:44:21.349Z+00:00 [debug] envoy.client(23) [Tags: "ConnectionId":"17817"] response complete 2024-04-15T02:44:21.349Z+00:00 [debug] envoy.http(23) [Tags: "ConnectionId":"17816","StreamId":"11384318054642815300"] Codec completed encoding stream. 2024-04-15T02:44:21.349Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"17831"] remote close 2024-04-15T02:44:21.349Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"17831"] closing socket: 0 2024-04-15T02:44:21.350Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"17832"] closing data_to_write=0 type=0 2024-04-15T02:44:21.350Z+00:00 [debug] envoy.connection(24) [Tags: "ConnectionId":"17832"] closing socket: 1 2024-04-15T02:44:21.350Z+00:00 [debug] envoy.pool(24) [Tags: "ConnectionId":"17832"] client disconnected, failure reason: 2024-04-15T02:44:21.350Z+00:00 [debug] envoy.pool(24) invoking idle callbacks - is_draining_for_deletion_=false 2024-04-15T02:44:21.350Z+00:00 [debug] envoy.pool(24) [Tags: "ConnectionId":"17832"] destroying stream: 0 remaining 2024-04-15T02:44:21.350Z+00:00 [debug] envoy.pool(24) invoking idle callbacks - is_draining_for_deletion_=false ```
Relevant logs from initial-context (consul-dataplane) I have verified that 10.244.1.241 is the router pod IP. ```txt 2024-04-15T02:50:25.545Z+00:00 [debug] envoy.conn_handler(23) [Tags: "ConnectionId":"24941"] new connection from 10.244.1.241:52194 2024-04-15T02:50:25.632Z+00:00 [debug] envoy.connection(23) [Tags: "ConnectionId":"24941"] remote address:10.244.1.241:52194,TLS_error:|268435612:SSL routines:OPENSSL_internal:HTTP_REQUEST:TLS_error_end 2024-04-15T02:50:25.632Z+00:00 [debug] envoy.connection(23) [Tags: "ConnectionId":"24941"] closing socket: 0 2024-04-15T02:50:25.632Z+00:00 [debug] envoy.connection(23) [Tags: "ConnectionId":"24941"] remote address:10.244.1.241:52194,TLS_error:|268435612:SSL routines:OPENSSL_internal:HTTP_REQUEST:TLS_error_end:TLS_error_end 2024-04-15T02:50:25.632Z+00:00 [debug] envoy.conn_handler(23) [Tags: "ConnectionId":"24941"] adding to cleanup list ```
Complete logs from initial-context (consul-connect-inject-init) I have verified that 10.244.1.234 is the consul-server pod IP. ```txt 2024-04-13T02:18:19.129Z [INFO] consul-server-connection-manager: trying to connect to a Consul server 2024-04-13T02:18:19.231Z [DEBUG] consul-server-connection-manager: Resolved DNS name: name=consul-server.consul.svc ip-addrs=["{10.244.1.234 }"] 2024-04-13T02:18:19.231Z [INFO] consul-server-connection-manager: discovered Consul servers: addresses=[10.244.1.234:8502] 2024-04-13T02:18:19.231Z [INFO] consul-server-connection-manager: current prioritized list of known Consul servers: addresses=[10.244.1.234:8502] 2024-04-13T02:18:19.231Z [DEBUG] consul-server-connection-manager: switching to Consul server: address=10.244.1.234:8502 2024-04-13T02:18:19.828Z [INFO] consul-server-connection-manager: ACL auth method login succeeded: accessorID=559ce7f4-8331-4462-8caa-220cbfecea6d 2024-04-13T02:18:19.925Z [DEBUG] consul-server-connection-manager: feature: supported=true name=DATAPLANE_FEATURES_EDGE_CERTIFICATE_MANAGEMENT 2024-04-13T02:18:19.925Z [DEBUG] consul-server-connection-manager: feature: supported=true name=DATAPLANE_FEATURES_ENVOY_BOOTSTRAP_CONFIGURATION 2024-04-13T02:18:19.925Z [DEBUG] consul-server-connection-manager: feature: supported=false name=DATAPLANE_FEATURES_FIPS 2024-04-13T02:18:19.925Z [DEBUG] consul-server-connection-manager: feature: supported=true name=DATAPLANE_FEATURES_WATCH_SERVERS 2024-04-13T02:18:19.925Z [INFO] consul-server-connection-manager: connected to Consul server: address=10.244.1.234:8502 2024-04-13T02:18:20.036Z [INFO] Registered service has been detected: service=initial-context 2024-04-13T02:18:20.036Z [INFO] Registered service has been detected: service=initial-context-sidecar-proxy 2024-04-13T02:18:20.036Z [INFO] consul-server-connection-manager: stopping 2024-04-13T02:18:20.132Z [INFO] consul-server-connection-manager: ACL auth method logout succeeded 2024-04-13T02:18:20.132Z [DEBUG] consul-server-connection-manager: backoff: retry after=638.791208ms 2024-04-13T02:18:20.132Z [DEBUG] consul-server-connection-manager: aborting: error="context canceled" 2024-04-13T02:18:21.029Z [INFO] Successfully applied traffic redirection rules 2024-04-13T02:18:21.029Z [INFO] Connect initialization completed 2024-04-13T02:18:21.029Z [INFO] consul-server-connection-manager: stopping ```
Complete logs from router (consul-connect-inject-init) I have verified that 10.244.1.234 is the consul-server pod IP. ```txt 2024-04-13T02:18:12.926Z [INFO] consul-server-connection-manager: trying to connect to a Consul server 2024-04-13T02:18:12.934Z [DEBUG] consul-server-connection-manager: Resolved DNS name: name=consul-server.consul.svc ip-addrs=["{10.244.1.234 }"] 2024-04-13T02:18:12.934Z [INFO] consul-server-connection-manager: discovered Consul servers: addresses=[10.244.1.234:8502] 2024-04-13T02:18:12.934Z [INFO] consul-server-connection-manager: current prioritized list of known Consul servers: addresses=[10.244.1.234:8502] 2024-04-13T02:18:12.934Z [DEBUG] consul-server-connection-manager: switching to Consul server: address=10.244.1.234:8502 2024-04-13T02:18:13.329Z [INFO] consul-server-connection-manager: ACL auth method login succeeded: accessorID=a7565bec-cfd9-c3a7-5b8d-4b5b72210b38 2024-04-13T02:18:13.331Z [DEBUG] consul-server-connection-manager: feature: supported=true name=DATAPLANE_FEATURES_WATCH_SERVERS 2024-04-13T02:18:13.331Z [DEBUG] consul-server-connection-manager: feature: supported=true name=DATAPLANE_FEATURES_EDGE_CERTIFICATE_MANAGEMENT 2024-04-13T02:18:13.331Z [DEBUG] consul-server-connection-manager: feature: supported=true name=DATAPLANE_FEATURES_ENVOY_BOOTSTRAP_CONFIGURATION 2024-04-13T02:18:13.332Z [DEBUG] consul-server-connection-manager: feature: supported=false name=DATAPLANE_FEATURES_FIPS 2024-04-13T02:18:13.332Z [INFO] consul-server-connection-manager: connected to Consul server: address=10.244.1.234:8502 2024-04-13T02:18:13.340Z [INFO] Registered service has been detected: service=router 2024-04-13T02:18:13.341Z [INFO] Registered service has been detected: service=router-sidecar-proxy 2024-04-13T02:18:13.341Z [INFO] consul-server-connection-manager: stopping 2024-04-13T02:18:13.347Z [INFO] consul-server-connection-manager: ACL auth method logout succeeded 2024-04-13T02:18:13.347Z [DEBUG] consul-server-connection-manager: backoff: retry after=383.794392ms 2024-04-13T02:18:13.347Z [DEBUG] consul-server-connection-manager: aborting: error="context canceled" 2024-04-13T02:18:14.726Z [INFO] Successfully applied traffic redirection rules 2024-04-13T02:18:14.925Z [INFO] Connect initialization completed 2024-04-13T02:18:14.925Z [INFO] consul-server-connection-manager: stopping ```

Current understanding and Expected behavior

My understanding is that Consul should automatically create and manage the TLS certs and distribute those certs to the Envoy pods. Additionally, this process should work, and I should not need to interfere with this process at all beyond configuration.

Environment details

Values.yaml used is included above.

consul-k8s: v1.4.0

consul:

Consul v1.18.1
Revision 98cb473c
Build Date 2024-03-26T21:59:08Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Kubernetes is DigitalOcean (DOKS) v1.29.1-do.0.

Cilium, CoreDNS, CSI, Hubble, and Konnectivity are installed and managed by DigitalOcean.

I have deployed Traefik as the ingress controller.

All infrastructure has been deployed using Terraform.

Additional Context

Router -> Initial Context ServiceIntention ```yaml apiVersion: consul.hashicorp.com/v1alpha1 kind: ServiceIntentions metadata: name: initial-context namespace: my-namespace spec: destination: name: initial-context sources: - name: router permissions: - action: allow http: methods: - GET - POST pathExact: /graph ```
blake commented 7 months ago

@craigbehnke Your application is trying to connect directly to the pod IP instead of the cluster IP for the K8s service, 10.245.228.123.

The traffic from the downstream proxy is being routed through the original-destination cluster which passes the connection directly through as a plain TCP connection without encryption. The upstream subsequently throws an error because it is expecting to receive TLS traffic, but instead is receiving an unencrypted connection from the downstream.

There's two potential solutions to this.

  1. Configure your application to connect to the upstream using the cluster IP address from the K8s Service.
  2. Enable transparentProxy.dialedDirectly=true in the ServiceDefaults config for the upstream service. This will allow downstream pods to access individual service instances by connecting to the pod IPs instead of the Service address.
craigbehnke commented 7 months ago

@blake You are a lifesaver! Thank you very much!

As an aside, is it better to target the service IP instead of the pod IP? If so, what would I change to get that to work? (I thought that the {svc-name}.service.consul address would resolve to that service-level resource)

blake commented 7 months ago

@craigbehnke You can either look up the service using the Kubernetes Service DNS name, or the Consul virtual IP address using the <name>.virtual.consul hostname. Either lookup will return an IP that allows Envoy to correctly match the incoming connection to the correct upstream service and route it appropriately.