hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
667 stars 316 forks source link

consul-dataplane container restarts #1814

Open liad5h opened 1 year ago

liad5h commented 1 year ago

Community Note

Overview of the Issue

I am using consul on kubernetes with connect enabled. I am trying to use connect to allow secure communication between my two apps - static-server & static-client. Connect works great for the first container I start, doesn't matter which one it is. The second pod always has the consul-dataplane container restarting due to readiness probe.

I did not experience this issue with consul 1.12.x and chart version < 1.x.

Consul version 1.14.2 Chart version 1.0.2

Reproduction Steps

Steps to reproduce this issue, eg:

Helm Values ``` global: name: consul datacenter: eu-central-1-qa enabled: false gossipEncryption: secretName: consul-gossip secretKey: key acls: manageSystemACLs: true bootstrapToken: secretName: consul-bootstrap-acl secretKey: token metrics: enabled: true enableAgentMetrics: true agentMetricsRetentionTime: "1m" defaultPrometheusScrapePath: "/metrics" enableGatewayMetrics: false client: enabled: false server: enabled: true replicas: 1 exposeGossipAndRPCPorts: true connect: true extraConfig: | { "performance": { "raft_multiplier": 1 }, "telemetry": { "disable_hostname": true } } # resources: requests: memory: "1Gi" cpu: "250m" limits: memory: "2Gi" cpu: "1000m" terminatingGateways: enabled: true prometheus: enabled: false connectInject: enabled: true replicas: 1 default: false cni: enabled: true logLevel: info cniBinDir: "/opt/cni/bin" cniNetDir: "/etc/cni/net.d" transparentProxy: defaultEnabled: false metrics: defaultEnabled: true defaultEnableMerging: true defaultPrometheusScrapePort: 20200 defaultPrometheusScrapePath: "/metrics" resources: requests: memory: "50Mi" cpu: "50m" limits: memory: "250Mi" cpu: "300m" ui: enabled: true metrics: enabled: true provider: "prometheus" service: type: ClusterIP ```

Run:

helm upgrade --install --values values.yaml consul hashicorp/consul --namespace consul --version "1.0.2"

cat > /tmp/backend.yaml <<EOF
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-server
---
apiVersion: v1
kind: Service
metadata:
  name: static-server
spec:
  selector:
    app: static-server
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: static-server
  name: static-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: static-server
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: 'true'
      labels:
        app: static-server
    spec:
      serviceAccountName: static-server
      containers:
        - name: static-server
          image: hashicorp/http-echo:latest
          args:
            - -text="hello world"
            - -listen=:8080
          ports:
            - containerPort: 8080
EOF

cat > /tmp/frontend.yaml <<EOF
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-client
---
apiVersion: v1
kind: Service
metadata:
  name: static-client
spec:
  selector:
    app: static-client
  ports:
    - port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: static-client
  name: static-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: static-client
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: 'true'
      labels:
        app: static-client
    spec:
      serviceAccountName: static-client
      containers:
        - name: static-client
          image: rancher/curlimages-curl:7.73.0
          command: ['/bin/sh', '-c', '--']
          args: ['while true; do sleep 30; done;']
EOF
Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 0ba7a401 version = 1.14.2 version_metadata = consul: acl = enabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.209.55.131:8300 server = true raft: applied_index = 8756 commit_index = 8756 fsm_pending = 0 last_contact = 0 last_log_index = 8756 last_log_term = 22 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:66d802f6-563e-1223-da38-5a907b19f317 Address:10.209.55.131:8300} {Suffrage:Voter ID:1e7eaab7-6d3a-273c-38e0-136b669a3555 Address:10.209.16.195:8300}] latest_configuration_index = 0 num_peers = 1 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 22 runtime: arch = amd64 cpu_count = 4 goroutines = 202 max_procs = 4 os = linux version = go1.19.2 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 4 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2 members = 2 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 3 members = 2 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Chart version 1.0.2 (Consul 1.14.x) AWS EKS 1.21

Log Fragments

2022-12-16T16:42:38.990Z+00:00 [info] envoy.admin(15) admin address: 127.0.0.1:19000
2022-12-16T16:42:38.990Z+00:00 [info] envoy.config(15) loading tracing configuration
2022-12-16T16:42:38.991Z+00:00 [info] envoy.config(15) loading 0 static secret(s)
2022-12-16T16:42:38.991Z+00:00 [info] envoy.config(15) loading 2 cluster(s)
2022-12-16T16:42:39.031Z+00:00 [info] envoy.config(15) loading 1 listener(s)
2022-12-16T16:42:39.035Z+00:00 [info] envoy.config(15) loading stats configuration
2022-12-16T16:42:39.035Z+00:00 [info] envoy.runtime(15) RTDS has finished initialization
2022-12-16T16:42:39.035Z+00:00 [info] envoy.upstream(15) cm init: initializing cds
2022-12-16T16:42:39.036Z+00:00 [warning] envoy.main(15) there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
2022-12-16T16:42:39.036Z+00:00 [info] envoy.main(15) starting main dispatch loop
2022-12-16T16:42:39.591Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:39.591Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:39.792Z [ERROR] consul-dataplane.server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unimplemented desc = unknown service hashicorp.consul.dataplane.DataplaneService"
2022-12-16T16:42:40.017Z+00:00 [warning] envoy.config(15) DeltaAggregatedResources gRPC config stream to consul-dataplane closed: 14, last resolver error: produced zero addresses (previously 8, this server has too many xDS streams open, please try another since 0s ago)
2022-12-16T16:42:40.874Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:40.874Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.55.131:8502, 10.209.16.195:8502]
2022-12-16T16:42:41.075Z [INFO]  consul-dataplane.server-connection-manager: connected to Consul server: address=10.209.55.131:8502
2022-12-16T16:42:41.075Z [INFO]  consul-dataplane.server-connection-manager: updated known Consul servers from watch stream: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:42.180Z+00:00 [warning] envoy.config(15) DeltaAggregatedResources gRPC config stream to consul-dataplane closed: 8, this server has too many xDS streams open, please try another (previously 14, last resolver error: produced zero addresses since 2s ago)
2022-12-16T16:42:43.490Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:43.490Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:43.693Z [ERROR] consul-dataplane.server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unimplemented desc = unknown service hashicorp.consul.dataplane.DataplaneService"
2022-12-16T16:42:45.275Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:45.275Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.55.131:8502, 10.209.16.195:8502]
2022-12-16T16:42:45.477Z [INFO]  consul-dataplane.server-connection-manager: connected to Consul server: address=10.209.55.131:8502
2022-12-16T16:42:45.477Z [INFO]  consul-dataplane.server-connection-manager: updated known Consul servers from watch stream: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:48.048Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:48.048Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:48.249Z [ERROR] consul-dataplane.server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unimplemented desc = unknown service hashicorp.consul.dataplane.DataplaneService"
2022-12-16T16:42:52.756Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:52.756Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.55.131:8502, 10.209.16.195:8502]
2022-12-16T16:42:52.957Z [INFO]  consul-dataplane.server-connection-manager: connected to Consul server: address=10.209.55.131:8502
2022-12-16T16:42:52.958Z [INFO]  consul-dataplane.server-connection-manager: updated known Consul servers from watch stream: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:54.035Z+00:00 [warning] envoy.config(15) gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
2022-12-16T16:42:54.035Z+00:00 [info] envoy.upstream(15) cm init: all clusters initialized
2022-12-16T16:42:54.035Z+00:00 [info] envoy.main(15) all clusters initialized. initializing init manager

2022-12-16T16:43:05.786Z+00:00 [warning] envoy.main(15) caught ENVOY_SIGTERM
2022-12-16T16:43:05.786Z+00:00 [info] envoy.main(15) shutting down server instance
2022-12-16T16:43:05.786Z+00:00 [info] envoy.main(15) main dispatch loop exited
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane.metrics: stopping the merged  server
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane.metrics: stopping consul dp promtheus server
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane.server-connection-manager: stopping
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane: context done stopping xds server
2022-12-16T16:43:05.787Z [INFO]  consul-dataplane: envoy process exited: error="signal: killed"
2022-12-16T16:43:05.792Z [INFO]  consul-dataplane.server-connection-manager: ACL auth method logout succeeded

kubernetes events from the pod:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m59s                  default-scheduler  Successfully assigned default/static-client-564d47f975-k9zh6 to ip-10-209-55-131.eu-central-1.compute.internal
  Normal   Pulled     4m58s                  kubelet            Container image "hashicorp/consul-k8s-control-plane:1.0.2" already present on machine
  Normal   Created    4m58s                  kubelet            Created container consul-connect-inject-init
  Normal   Started    4m58s                  kubelet            Started container consul-connect-inject-init
  Normal   Pulled     4m56s                  kubelet            Container image "rancher/curlimages-curl:7.73.0" already present on machine
  Normal   Created    4m56s                  kubelet            Created container static-client
  Normal   Started    4m56s                  kubelet            Started container static-client
  Normal   Pulled     4m29s (x2 over 4m56s)  kubelet            Container image "hashicorp/consul-dataplane:1.0.0" already present on machine
  Normal   Created    4m29s (x2 over 4m56s)  kubelet            Created container consul-dataplane
  Normal   Started    4m29s (x2 over 4m56s)  kubelet            Started container consul-dataplane
  Normal   Killing    4m29s                  kubelet            Container consul-dataplane failed liveness probe, will be restarted
  Warning  Unhealthy  4m19s (x7 over 4m55s)  kubelet            Readiness probe failed: dial tcp 10.209.40.109:20000: connect: connection refused
  Warning  Unhealthy  4m9s (x5 over 4m49s)   kubelet            Liveness probe failed: dial tcp 10.209.40.109:20000: connect: connection refused
chris93111 commented 1 year ago

Hi same issue with helm chart and consul 1.14.2

Any update ?

KESNERO commented 1 year ago

I faced to the same case, is there any solution yet?

christophermichaeljohnston commented 11 months ago

Been having the same issue for a while now and opened #2509 before seeing this one. On another read its a different issue but similiar behavior... the dataplane just self destructs. It seems like the consul-dataplane doesn't reach a healthy state so it exits.