linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.59k stars 1.27k forks source link

kubectl command starts failing after a few minutes #4980

Closed rcjames closed 4 years ago

rcjames commented 4 years ago

Bug Report

What is the issue?

Using kubectl to provision a secret from within the cluster, the pod stops being able to access the Kubernetes API after a few minutes. The underlying host is still able to contact the master and running the pod without the linkerd-proxy sidecar makes the issue disappear.

How can it be reproduced?

Create a new AWS EKS cluster, v1.16.

Install kube2iam and cert-manager. Follow the guide on automatically rotating control plane TLS credentials here - https://linkerd.io/2/tasks/automatically-rotating-control-plane-tls-credentials/

Install Linkerd Edge 20.9.2 via Helm Chart:

enablePodAntiAffinity: true
# https://linkerd.io/2/tasks/automatically-rotating-control-plane-tls-credentials/#installing-with-helm
installNamespace: false

global:
  # kubectl get secret linkerd-identity-issuer -o yaml -n linkerd | grep ca.crt | cut -d ':' -f 2 | awk '{print $1}' | base64 -d
  identityTrustAnchorsPEM: |-
    -----BEGIN CERTIFICATE-----
    -----END CERTIFICATE-----

  # Indicate if the CNI is installed: https://linkerd.io/2/tasks/install-helm/#disabling-the-proxy-init-container
  cniEnabled: false

  # proxy configuration
  proxy:
    logLevel: warn,linkerd2_proxy=warn
    resources:
      cpu:
        limit: 100m
        request: 25m
      memory:
        limit: 100Mi
        request: 20Mi
    # https://linkerd.io/2/tasks/graceful-shutdown/
    waitBeforeExitSeconds: 15

# controller configuration
controllerReplicas: 3
controllerResources: &controller_resources
  cpu: &controller_resources_cpu
    limit: "0.5"
    request: "0.1"
  memory:
    limit: 250Mi
    request: 50Mi

destinationResources:
  cpu:
    limit: "0.5"
    request: "0.1"
  memory:
    limit: 250Mi
    request: 50Mi

publicAPIResources:
  cpu:
    limit: "0.5"
    request: "0.1"
  memory:
    limit: 250Mi
    request: 50Mi

# https://linkerd.io/2/tasks/automatically-rotating-control-plane-tls-credentials/#installing-with-helm
identity:
  issuer:
    scheme: kubernetes.io/tls

# identity configuration
identityResources:
  cpu: *controller_resources_cpu
  memory:
    limit: 250Mi
    request: 10Mi

# grafana configuration
grafana:
  resources:
    cpu: *controller_resources_cpu
    memory:
      limit: 1024Mi
      request: 50Mi

# heartbeat configuration
heartbeatResources: *controller_resources

prometheus:
  globalConfig:
    external_labels:
      federation: "linkerd2"

# prometheus configuration
prometheusResources:
  cpu:
    limit: "4"
    request: 300m
  memory:
    limit: 8192Mi
    request: 6000Mi

# proxy injector configuration
proxyInjectorResources: *controller_resources
webhookFailurePolicy: Fail

# service profile validator configuration
spValidatorResources: *controller_resources

# tap configuration
tapResources: *controller_resources

# web configuration
webResources: *controller_resources

Create a Namespace, Service Account, Role and Deployment for provisioning a secret

---
apiVersion: v1
kind: Namespace
metadata:
  name: super-secret
  annotations:
    "linkerd.io/inject": "enabled"

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: super-secret
  namespace: super-secret

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: super-secret
  namespace: super-secret
rules:
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - '*'

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: super-secret
  namespace: super-secret
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: super-secret
subjects:
- kind: ServiceAccount
  name: super-secret

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: super-secret
  namespace: super-secret
spec:
  selector:
    matchLabels:
      app: super-secret
  template:
    metadata:
      labels:
        app: super-secret
    spec:
      serviceAccountName: super-secret
      containers:
      - name: kubectl
        image: bitnami/kubectl:1.17
        command:
        - /bin/bash
        args:
        - -c
        - |
          #!/bin/bash
          SECRET_DIR=$(mktemp -d)
          echo "password" > $SECRET_DIR/password

          while true; do
            echo "$(date +'%F %T') Applying secret"
            kubectl create secret generic \
                --namespace $NAMESPACE \
                --from-file=$SECRET_DIR \
                super-secret \
                --dry-run \
                -o yaml | \
            kubectl apply -f -
            echo "$(date +'%F %T') Waiting for 10 seconds"
            sleep 10
          done
        env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

Tail the logs and after 6-10 minutes the pod should start to become unable to contact the Kubernetes API (excerpt below)

2020-09-17 08:15:07 Applying secret
secret/super-secret configured
2020-09-17 08:15:07 Waiting for 10 seconds
2020-09-17 08:15:17 Applying secret
secret/super-secret configured
2020-09-17 08:15:17 Waiting for 10 seconds
2020-09-17 08:15:27 Applying secret
secret/super-secret configured
2020-09-17 08:15:28 Waiting for 10 seconds
2020-09-17 08:15:38 Applying secret
secret/super-secret configured
2020-09-17 08:15:38 Waiting for 10 seconds
2020-09-17 08:15:48 Applying secret
Unable to connect to the server: EOF
2020-09-17 08:15:51 Waiting for 10 seconds
2020-09-17 08:16:01 Applying secret
Unable to connect to the server: EOF
2020-09-17 08:16:01 Waiting for 10 seconds
2020-09-17 08:16:11 Applying secret
Unable to connect to the server: EOF
2020-09-17 08:16:11 Waiting for 10 seconds
2020-09-17 08:16:21 Applying secret
Unable to connect to the server: EOF
2020-09-17 08:16:21 Waiting for 10 seconds

Logs, error output, etc

Some snippets of logs from linkerd-proxy where I think problems might be occuring,but I'm not too sure.

[   362.853092058s]  INFO ThreadId(01) outbound: linkerd2_proxy_api_resolve::resolve: No endpoints
[   362.853116199s] DEBUG ThreadId(01) outbound: linkerd2_proxy_api_resolve::resolve: Add endpoints=1
[   362.853137039s] DEBUG ThreadId(01) outbound: linkerd2_app_outbound::endpoint: Resolved endpoint dst=10.100.0.1:443 addr=10.100.0.1:443 metadata=Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None, authority_override: None }
[   362.853149169s] DEBUG ThreadId(01) outbound: linkerd2_app_outbound::prevent_loop: addr=10.100.0.1:443 self.port=4140
[   362.853167969s]  INFO ThreadId(01) outbound: linkerd2_proxy_api_resolve::resolve: No endpoints
[   371.392221350s] DEBUG ThreadId(01) outbound: tower::balance::p2c::service: updating from discover
[   371.392267920s] DEBUG ThreadId(01) outbound: linkerd2_timeout::failfast: Failing
[   371.392364101s]  INFO ThreadId(01) outbound:accept{peer.addr=172.22.24.90:51560}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast

A larger snippet with all debug logs from a time when requests were successful and then when they start failing. https://gist.github.com/rcjames/d8d7d95501d05348828ae9e4131d36bf

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2020-09-17T18:07:30Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ proxy-injector webhook has valid cert
√ sp-validator webhook has valid cert

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/checks/#l5d-injection-disabled for hints

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

Environment

Pothulapati commented 4 years ago

@rcjames Thanks for the detailed report.

in edge 20.9.2, We've added disovery to all outbound connections and here its trying to discover the x:443 address of the k8s api and failing, This case is similar to that of the Linkerd Control Plane pods as they need access to the K8s API. We skip 443 from being discovered for the control-plane components and you should probably do the same by adding the config.linkerd.io/skip-outbound-ports: 443 annotation to the pod spec as per https://linkerd.io/2/reference/proxy-configuration/

Can you try doing this, and reply back with your findings?

rcjames commented 4 years ago

@Pothulapati - Thank you for your swift response. Adding config.linkerd.io/skip-outbound-ports: "443" (with quotes :facepalm:) has solved this problem. Thank you very much for your help!