consul-dataplane container restarts

liad5h commented 1 year ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

I am using consul on kubernetes with connect enabled. I am trying to use connect to allow secure communication between my two apps - static-server & static-client. Connect works great for the first container I start, doesn't matter which one it is. The second pod always has the consul-dataplane container restarting due to readiness probe.

I did not experience this issue with consul 1.12.x and chart version < 1.x.

Consul version 1.14.2 Chart version 1.0.2

Reproduction Steps

Steps to reproduce this issue, eg:

Helm Values

``` global: name: consul datacenter: eu-central-1-qa enabled: false gossipEncryption: secretName: consul-gossip secretKey: key acls: manageSystemACLs: true bootstrapToken: secretName: consul-bootstrap-acl secretKey: token metrics: enabled: true enableAgentMetrics: true agentMetricsRetentionTime: "1m" defaultPrometheusScrapePath: "/metrics" enableGatewayMetrics: false client: enabled: false server: enabled: true replicas: 1 exposeGossipAndRPCPorts: true connect: true extraConfig: | { "performance": { "raft_multiplier": 1 }, "telemetry": { "disable_hostname": true } } # resources: requests: memory: "1Gi" cpu: "250m" limits: memory: "2Gi" cpu: "1000m" terminatingGateways: enabled: true prometheus: enabled: false connectInject: enabled: true replicas: 1 default: false cni: enabled: true logLevel: info cniBinDir: "/opt/cni/bin" cniNetDir: "/etc/cni/net.d" transparentProxy: defaultEnabled: false metrics: defaultEnabled: true defaultEnableMerging: true defaultPrometheusScrapePort: 20200 defaultPrometheusScrapePath: "/metrics" resources: requests: memory: "50Mi" cpu: "50m" limits: memory: "250Mi" cpu: "300m" ui: enabled: true metrics: enabled: true provider: "prometheus" service: type: ClusterIP ```

Run:

helm upgrade --install --values values.yaml consul hashicorp/consul --namespace consul --version "1.0.2"

cat > /tmp/backend.yaml <<EOF
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-server
---
apiVersion: v1
kind: Service
metadata:
  name: static-server
spec:
  selector:
    app: static-server
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: static-server
  name: static-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: static-server
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: 'true'
      labels:
        app: static-server
    spec:
      serviceAccountName: static-server
      containers:
        - name: static-server
          image: hashicorp/http-echo:latest
          args:
            - -text="hello world"
            - -listen=:8080
          ports:
            - containerPort: 8080
EOF

cat > /tmp/frontend.yaml <<EOF
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-client
---
apiVersion: v1
kind: Service
metadata:
  name: static-client
spec:
  selector:
    app: static-client
  ports:
    - port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: static-client
  name: static-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: static-client
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: 'true'
      labels:
        app: static-client
    spec:
      serviceAccountName: static-client
      containers:
        - name: static-client
          image: rancher/curlimages-curl:7.73.0
          command: ['/bin/sh', '-c', '--']
          args: ['while true; do sleep 30; done;']
EOF

Server info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 0ba7a401 version = 1.14.2 version_metadata = consul: acl = enabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.209.55.131:8300 server = true raft: applied_index = 8756 commit_index = 8756 fsm_pending = 0 last_contact = 0 last_log_index = 8756 last_log_term = 22 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:66d802f6-563e-1223-da38-5a907b19f317 Address:10.209.55.131:8300} {Suffrage:Voter ID:1e7eaab7-6d3a-273c-38e0-136b669a3555 Address:10.209.16.195:8300}] latest_configuration_index = 0 num_peers = 1 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 22 runtime: arch = amd64 cpu_count = 4 goroutines = 202 max_procs = 4 os = linux version = go1.19.2 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 4 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2 members = 2 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 3 members = 2 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Chart version 1.0.2 (Consul 1.14.x) AWS EKS 1.21

Log Fragments

2022-12-16T16:42:38.990Z+00:00 [info] envoy.admin(15) admin address: 127.0.0.1:19000
2022-12-16T16:42:38.990Z+00:00 [info] envoy.config(15) loading tracing configuration
2022-12-16T16:42:38.991Z+00:00 [info] envoy.config(15) loading 0 static secret(s)
2022-12-16T16:42:38.991Z+00:00 [info] envoy.config(15) loading 2 cluster(s)
2022-12-16T16:42:39.031Z+00:00 [info] envoy.config(15) loading 1 listener(s)
2022-12-16T16:42:39.035Z+00:00 [info] envoy.config(15) loading stats configuration
2022-12-16T16:42:39.035Z+00:00 [info] envoy.runtime(15) RTDS has finished initialization
2022-12-16T16:42:39.035Z+00:00 [info] envoy.upstream(15) cm init: initializing cds
2022-12-16T16:42:39.036Z+00:00 [warning] envoy.main(15) there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
2022-12-16T16:42:39.036Z+00:00 [info] envoy.main(15) starting main dispatch loop
2022-12-16T16:42:39.591Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:39.591Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:39.792Z [ERROR] consul-dataplane.server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unimplemented desc = unknown service hashicorp.consul.dataplane.DataplaneService"
2022-12-16T16:42:40.017Z+00:00 [warning] envoy.config(15) DeltaAggregatedResources gRPC config stream to consul-dataplane closed: 14, last resolver error: produced zero addresses (previously 8, this server has too many xDS streams open, please try another since 0s ago)
2022-12-16T16:42:40.874Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:40.874Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.55.131:8502, 10.209.16.195:8502]
2022-12-16T16:42:41.075Z [INFO]  consul-dataplane.server-connection-manager: connected to Consul server: address=10.209.55.131:8502
2022-12-16T16:42:41.075Z [INFO]  consul-dataplane.server-connection-manager: updated known Consul servers from watch stream: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:42.180Z+00:00 [warning] envoy.config(15) DeltaAggregatedResources gRPC config stream to consul-dataplane closed: 8, this server has too many xDS streams open, please try another (previously 14, last resolver error: produced zero addresses since 2s ago)
2022-12-16T16:42:43.490Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:43.490Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:43.693Z [ERROR] consul-dataplane.server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unimplemented desc = unknown service hashicorp.consul.dataplane.DataplaneService"
2022-12-16T16:42:45.275Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:45.275Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.55.131:8502, 10.209.16.195:8502]
2022-12-16T16:42:45.477Z [INFO]  consul-dataplane.server-connection-manager: connected to Consul server: address=10.209.55.131:8502
2022-12-16T16:42:45.477Z [INFO]  consul-dataplane.server-connection-manager: updated known Consul servers from watch stream: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:48.048Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:48.048Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:48.249Z [ERROR] consul-dataplane.server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unimplemented desc = unknown service hashicorp.consul.dataplane.DataplaneService"
2022-12-16T16:42:52.756Z [INFO]  consul-dataplane.server-connection-manager: trying to connect to a Consul server
2022-12-16T16:42:52.756Z [INFO]  consul-dataplane.server-connection-manager: current prioritized list of known Consul servers: addresses=[10.209.55.131:8502, 10.209.16.195:8502]
2022-12-16T16:42:52.957Z [INFO]  consul-dataplane.server-connection-manager: connected to Consul server: address=10.209.55.131:8502
2022-12-16T16:42:52.958Z [INFO]  consul-dataplane.server-connection-manager: updated known Consul servers from watch stream: addresses=[10.209.16.195:8502, 10.209.55.131:8502]
2022-12-16T16:42:54.035Z+00:00 [warning] envoy.config(15) gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
2022-12-16T16:42:54.035Z+00:00 [info] envoy.upstream(15) cm init: all clusters initialized
2022-12-16T16:42:54.035Z+00:00 [info] envoy.main(15) all clusters initialized. initializing init manager

2022-12-16T16:43:05.786Z+00:00 [warning] envoy.main(15) caught ENVOY_SIGTERM
2022-12-16T16:43:05.786Z+00:00 [info] envoy.main(15) shutting down server instance
2022-12-16T16:43:05.786Z+00:00 [info] envoy.main(15) main dispatch loop exited
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane.metrics: stopping the merged  server
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane.metrics: stopping consul dp promtheus server
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane.server-connection-manager: stopping
2022-12-16T16:43:05.786Z [INFO]  consul-dataplane: context done stopping xds server
2022-12-16T16:43:05.787Z [INFO]  consul-dataplane: envoy process exited: error="signal: killed"
2022-12-16T16:43:05.792Z [INFO]  consul-dataplane.server-connection-manager: ACL auth method logout succeeded

kubernetes events from the pod:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m59s                  default-scheduler  Successfully assigned default/static-client-564d47f975-k9zh6 to ip-10-209-55-131.eu-central-1.compute.internal
  Normal   Pulled     4m58s                  kubelet            Container image "hashicorp/consul-k8s-control-plane:1.0.2" already present on machine
  Normal   Created    4m58s                  kubelet            Created container consul-connect-inject-init
  Normal   Started    4m58s                  kubelet            Started container consul-connect-inject-init
  Normal   Pulled     4m56s                  kubelet            Container image "rancher/curlimages-curl:7.73.0" already present on machine
  Normal   Created    4m56s                  kubelet            Created container static-client
  Normal   Started    4m56s                  kubelet            Started container static-client
  Normal   Pulled     4m29s (x2 over 4m56s)  kubelet            Container image "hashicorp/consul-dataplane:1.0.0" already present on machine
  Normal   Created    4m29s (x2 over 4m56s)  kubelet            Created container consul-dataplane
  Normal   Started    4m29s (x2 over 4m56s)  kubelet            Started container consul-dataplane
  Normal   Killing    4m29s                  kubelet            Container consul-dataplane failed liveness probe, will be restarted
  Warning  Unhealthy  4m19s (x7 over 4m55s)  kubelet            Readiness probe failed: dial tcp 10.209.40.109:20000: connect: connection refused
  Warning  Unhealthy  4m9s (x5 over 4m49s)   kubelet            Liveness probe failed: dial tcp 10.209.40.109:20000: connect: connection refused

chris93111 commented 1 year ago

Hi same issue with helm chart and consul 1.14.2

Any update ?

KESNERO commented 1 year ago

I faced to the same case, is there any solution yet?

christophermichaeljohnston commented 11 months ago

Been having the same issue for a while now and opened #2509 before seeing this one. On another read its a different issue but similiar behavior... the dataplane just self destructs. It seems like the consul-dataplane doesn't reach a healthy state so it exits.

hashicorp / consul-k8s