linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.57k stars 1.27k forks source link

linkerd-identity connection issues / retries #3388

Closed michalschott closed 4 years ago

michalschott commented 5 years ago

Bug Report

What is the issue?

Kubernetes 1.12.10 + Calico CNI 3.7.4 (KOPS 1.12.2),obviously on AWS.

Currently running edge-19.8.7, but we had similar problem with versions 2.3.0, 2.3.1, 2.3.2, 2.4.0 and 2.5.0. Happens with --ha control plane installed as well.

Cluster is basicaly non HA - single AZ, single master node and few workers. To save some money, we're using ASG scheduled scaling to scale ASGs to 0 and bring them back on for working hours.

Every morning we face the same problem:

➜  ~ k -n linkerd get pod
NAME                                     READY   STATUS    RESTARTS   AGE
linkerd-cni-l9nbn                        1/1     Running   0          76s
linkerd-cni-lqfb4                        1/1     Running   0          78s
linkerd-cni-mmlnc                        1/1     Running   0          81s
linkerd-controller-75f98dd669-zch6q      2/3     Running   0          86s
linkerd-grafana-78cbd8d65-pqjk2          1/2     Running   0          86s
linkerd-identity-646bb685d8-d9z4j        2/2     Running   0          86s
linkerd-prometheus-6f789b6c76-z54hc      1/2     Running   0          86s
linkerd-proxy-injector-88f88df7f-r5hcd   1/2     Running   0          86s
linkerd-sp-validator-5cf54b4f56-kgjkk    1/2     Running   0          86s
linkerd-tap-57564d5949-st5zq             1/2     Running   0          85s
linkerd-web-7c4c4c5b98-gbwjl             1/2     Running   0          85s

➜  ~ stern -n linkerd linkerd-identity
+ linkerd-identity-646bb685d8-d9z4j › linkerd-proxy
+ linkerd-identity-646bb685d8-d9z4j › identity
linkerd-identity-646bb685d8-d9z4j identity time="2019-09-05T10:21:02Z" level=info msg="running version edge-19.8.7"
linkerd-identity-646bb685d8-d9z4j identity time="2019-09-05T10:21:02Z" level=info msg="starting gRPC server on :8080"
linkerd-identity-646bb685d8-d9z4j identity time="2019-09-05T10:21:02Z" level=info msg="starting admin server on :9990"
linkerd-identity-646bb685d8-d9z4j identity time="2019-09-05T10:21:10Z" level=info msg="certifying linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2019-09-06 10:21:30 +0000 UTC"
linkerd-identity-646bb685d8-d9z4j linkerd-proxy time="2019-09-05T10:21:10Z" level=info msg="running version edge-19.8.7"
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.005676s] linkerd2_proxy::app::main using destination service at ControlAddr { addr: Name(NameAddr { name: "linkerd-destination.linkerd.svc.cluster.local", port: 8086 }), identity: Some("linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local") }
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.005710s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "localhost.", port: 8080 })
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.005717s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.005722s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.005726s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.005729s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.005740s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
linkerd-identity-646bb685d8-d9z4j linkerd-proxy INFO [     0.013214s] linkerd2_proxy::app::main Certified identity: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local

➜  ~ stern -n linkerd linkerd-controller
+ linkerd-controller-75f98dd669-zch6q › linkerd-proxy
+ linkerd-controller-75f98dd669-zch6q › public-api
+ linkerd-controller-75f98dd669-zch6q › destination
linkerd-controller-75f98dd669-zch6q destination time="2019-09-05T10:20:34Z" level=info msg="running version edge-19.8.7"
linkerd-controller-75f98dd669-zch6q destination time="2019-09-05T10:20:34Z" level=info msg="waiting for caches to sync"
linkerd-controller-75f98dd669-zch6q destination time="2019-09-05T10:20:34Z" level=info msg="caches synced"
linkerd-controller-75f98dd669-zch6q destination time="2019-09-05T10:20:34Z" level=info msg="starting admin server on :9996"
linkerd-controller-75f98dd669-zch6q destination time="2019-09-05T10:20:34Z" level=info msg="starting gRPC server on :8086"
linkerd-controller-75f98dd669-zch6q public-api time="2019-09-05T10:20:33Z" level=info msg="running version edge-19.8.7"
linkerd-controller-75f98dd669-zch6q public-api time="2019-09-05T10:20:33Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-75f98dd669-zch6q public-api time="2019-09-05T10:20:33Z" level=info msg="waiting for caches to sync"
linkerd-controller-75f98dd669-zch6q public-api time="2019-09-05T10:20:33Z" level=info msg="caches synced"
linkerd-controller-75f98dd669-zch6q public-api time="2019-09-05T10:20:33Z" level=info msg="starting admin server on :9995"
linkerd-controller-75f98dd669-zch6q public-api time="2019-09-05T10:20:33Z" level=info msg="starting HTTP server on :8085"
linkerd-controller-75f98dd669-zch6q linkerd-proxy time="2019-09-05T10:20:35Z" level=info msg="running version edge-19.8.7"
linkerd-controller-75f98dd669-zch6q linkerd-proxy INFO [     0.003039s] linkerd2_proxy::app::main using destination service at ControlAddr { addr: Name(NameAddr { name: "localhost.", port: 8086 }), identity: None(NoPeerName(Loopback)) }
linkerd-controller-75f98dd669-zch6q linkerd-proxy INFO [     0.003068s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "linkerd-identity.linkerd.svc.cluster.local", port: 8080 })
linkerd-controller-75f98dd669-zch6q linkerd-proxy INFO [     0.003073s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
linkerd-controller-75f98dd669-zch6q linkerd-proxy INFO [     0.003078s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
linkerd-controller-75f98dd669-zch6q linkerd-proxy INFO [     0.003082s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
linkerd-controller-75f98dd669-zch6q linkerd-proxy INFO [     0.003085s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
linkerd-controller-75f98dd669-zch6q linkerd-proxy INFO [     0.003094s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
linkerd-controller-75f98dd669-zch6q linkerd-proxy WARN [     0.510107s] linkerd-identity.linkerd.svc.cluster.local:8080 linkerd2_proxy::proxy::reconnect connect error to ControlAddr { addr: Name(NameAddr { name: "linkerd-identity.linkerd.svc.cluster.local", port: 8080 }), identity: Some("linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local") }: request timed out
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [     5.005627s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [    20.009398s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [    35.011593s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [    50.015588s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [    65.018544s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [    80.020385s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [    95.022192s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [   110.024145s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
linkerd-controller-75f98dd669-zch6q linkerd-proxy ERR! [   125.026140s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"

After restarting linkerd-identity pod (k -n linkerd delete pod), need to wait few secs:

➜  ~ k -n linkerd get pod
NAME                                     READY   STATUS    RESTARTS   AGE
linkerd-cni-l9nbn                        1/1     Running   0          3m29s
linkerd-cni-lqfb4                        1/1     Running   0          3m31s
linkerd-cni-mmlnc                        1/1     Running   0          3m34s
linkerd-controller-75f98dd669-zch6q      3/3     Running   0          3m39s
linkerd-grafana-78cbd8d65-pqjk2          2/2     Running   0          3m39s
linkerd-identity-646bb685d8-wsmqv        2/2     Running   1          42s
linkerd-prometheus-6f789b6c76-z54hc      2/2     Running   0          3m39s
linkerd-proxy-injector-88f88df7f-r5hcd   2/2     Running   0          3m39s
linkerd-sp-validator-5cf54b4f56-kgjkk    2/2     Running   0          3m39s
linkerd-tap-57564d5949-st5zq             2/2     Running   0          3m38s
linkerd-web-7c4c4c5b98-gbwjl             2/2     Running   0          3m38s

And it starts to work. So my understanding is retries are not being handled proprely?

Attaching linkerd configmaps:

➜  ~ k -n linkerd get cm linkerd-cni-config -o yaml
apiVersion: v1
data:
  cni_network_config: |-
    {
      "name": "linkerd-cni",
      "type": "linkerd-cni",
      "log_level": "__LOG_LEVEL__",
      "policy": {
          "type": "k8s",
          "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
          "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
      },
      "kubernetes": {
          "kubeconfig": "__KUBECONFIG_FILEPATH__"
      },
      "linkerd": {
        "incoming-proxy-port": __INCOMING_PROXY_PORT__,
        "outgoing-proxy-port": __OUTGOING_PROXY_PORT__,
        "proxy-uid": __PROXY_UID__,
        "ports-to-redirect": [__PORTS_TO_REDIRECT__],
        "inbound-ports-to-ignore": [__INBOUND_PORTS_TO_IGNORE__],
        "outbound-ports-to-ignore": [__OUTBOUND_PORTS_TO_IGNORE__],
        "simulate": __SIMULATE__,
        "use-wait-flag": __USE_WAIT_FLAG__
      }
    }
  dest_cni_bin_dir: /opt/cni/bin
  dest_cni_net_dir: /etc/cni/net.d
  inbound_ports_to_ignore: 4190,4191
  incoming_proxy_port: "4143"
  log_level: info
  outbound_ports_to_ignore: ""
  outgoing_proxy_port: "4140"
  proxy_uid: "2102"
  simulate: "false"
  use_wait_flag: "false"
kind: ConfigMap
metadata:
  labels:
    linkerd.io/cni-resource: "true"
    linkerd.io/control-plane-ns: linkerd
  name: linkerd-cni-config
  namespace: linkerd

➜  ~ k -n linkerd get cm linkerd-config -o yaml
apiVersion: v1
data:
  global: |
    {"linkerdNamespace":"linkerd","cniEnabled":true,"version":"edge-19.8.7","identityContext":{"trustDomain":"cluster.local","trustAnchorsPem":"XXX","issuanceLifetime":"86400s","clockSkewAllowance":"20s"},"autoInjectContext":null,"omitWebhookSideEffects":false,"clusterDomain":"cluster.local"}
  install: |
    {"uuid":"ea36a5a6-9df8-4d0c-ad32-64d971a2021d","cliVersion":"edge-19.8.7","flags":[{"name":"image-pull-policy","value":"Always"},{"name":"linkerd-cni-enabled","value":"true"},{"name":"proxy-cpu-limit","value":"100m"},{"name":"proxy-cpu-request","value":"25m"},{"name":"proxy-memory-limit","value":"64Mi"},{"name":"proxy-memory-request","value":"32Mi"}]}
  proxy: |
    {"proxyImage":{"imageName":"gcr.io/linkerd-io/proxy","pullPolicy":"Always"},"proxyInitImage":{"imageName":"gcr.io/linkerd-io/proxy-init","pullPolicy":"Always"},"controlPort":{"port":4190},"ignoreInboundPorts":[],"ignoreOutboundPorts":[],"inboundPort":{"port":4143},"adminPort":{"port":4191},"outboundPort":{"port":4140},"resource":{"requestCpu":"25m","requestMemory":"32Mi","limitCpu":"100m","limitMemory":"64Mi"},"proxyUid":"2102","logLevel":{"level":"warn,linkerd2_proxy=info"},"disableExternalProfiles":true,"proxyVersion":"edge-19.8.7","proxyInitImageVersion":"v1.1.0"}
kind: ConfigMap
metadata:
  annotations:
    linkerd.io/created-by: linkerd/cli edge-19.8.7
  labels:
    linkerd.io/control-plane-component: controller
    linkerd.io/control-plane-ns: linkerd
  name: linkerd-config
  namespace: linkerd
grampelberg commented 5 years ago

This appears to be an ordering issue resulting from scaling the cluster to zero. I would assume that proxies would still retry to get their certificates though until identity comes online.

michalschott commented 5 years ago

That's correct, we're scaling down all worker nodes to 0.

Proxies still are trying to reach identity service, but failing until identity pods are manually restarted.

grampelberg commented 5 years ago

Oh, that's an interesting tidbit. The identity pod comes up after you've scaled up ... but that needs to be restarted again to make everything work?

michalschott commented 5 years ago

That's correct.

cpretzer commented 5 years ago

@michalschott I'll work on reproducing this and let you know what I find.

michalschott commented 4 years ago

@cpretzer any progress on this?

cpretzer commented 4 years ago

@michalschott

Sorry for the delay in the update to this issue. TL:DR: I ran a test on GKE and I think I've reproduced the issue. It found an error message in the calico pods which I have to explore further.

The Details:

Using the booksapp demo, the first interesting thing that I found is that the app continues to service requests. What I mean by that is that the page for booksapp loads in my browser, despite the state of the pods as described below.

In order to reproduce this, I created a three node cluster in GKE and deployed linkerd and the booksapp demo. I then scaled the cluster to 0 nodes using the gcloud cli, then scaled it back up to three nodes. This makes an assumption that it's not the particular implementation (EKS vs GKE) that causes the errors, but rather the restarting of the pods in the cluster.

That being said, I see the following output from the proxy logs (I'm not using stern in my tests):

time="2019-09-20T19:55:08Z" level=info msg="running version stable-2.5.0"
INFO [     0.005047s] linkerd2_proxy::app::main using destination service at Some(ControlAddr { addr: Name(NameAddr { name: "linkerd-destination.linkerd.svc.cluster.local", port: 8086 }), identity: Some("linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local") })
INFO [     0.005182s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "localhost.", port: 8080 })
INFO [     0.005213s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
INFO [     0.005232s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
INFO [     0.005285s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
INFO [     0.005309s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
INFO [     0.005333s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
INFO [     0.024629s] linkerd2_proxy::app::main Certified identity: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
WARN [    15.147864s] linkerd-destination.linkerd.svc.cluster.local:8086 linkerd2_proxy::proxy::reconnect connect error to ControlAddr { addr: Name(NameAddr { name: "linkerd-destination.linkerd.svc.cluster.local", port: 8086 }), identity: Some("linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local") }: request timed out
WARN [    19.626004s] linkerd2_proxy::app::profiles error fetching profile for linkerd-identity.linkerd.svc.cluster.local:8080: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
WARN [    27.629495s] linkerd2_proxy::app::profiles error fetching profile for linkerd-identity.linkerd.svc.cluster.local:8080: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
error: unexpected EOF
kubectl logs -n linkerd deploy/linkerd-controller -c linkerd-proxy
time="2019-09-20T19:55:08Z" level=info msg="running version stable-2.5.0"
INFO [     0.015205s] linkerd2_proxy::app::main using destination service at Some(ControlAddr { addr: Name(NameAddr { name: "localhost.", port: 8086 }), identity: None(NoPeerName(Loopback)) })
INFO [     0.015298s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "linkerd-identity.linkerd.svc.cluster.local", port: 8080 })
INFO [     0.015304s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
INFO [     0.015311s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
INFO [     0.015318s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
INFO [     0.015321s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
INFO [     0.015329s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
ERR! [     5.022250s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
INFO [    15.083243s] linkerd2_proxy::app::main Certified identity: linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local

The linkerd pods look healthy:

kubectl get po -n linkerd
NAME                                      READY   STATUS    RESTARTS   AGE
linkerd-controller-5bbdcc47c-k56v2        3/3     Running   0          75m
linkerd-grafana-57b9ccc985-bs5qg          2/2     Running   0          75m
linkerd-identity-67b46c77fb-v28fc         2/2     Running   0          75m
linkerd-prometheus-6c454f4976-rdbnd       2/2     Running   0          75m
linkerd-proxy-injector-7f75bcdfb7-t4f7l   2/2     Running   0          75m
linkerd-sp-validator-54d69c97db-h42zx     2/2     Running   0          75m
linkerd-tap-5d4598f4b4-dwmqb              2/2     Running   0          75m
linkerd-web-748bccc688-gfd7c              2/2     Running   0          75m

At this point, the relevant information that I see comes from the pods in the kube-system namespace:

kubectl get po -n kube-system
NAME                                                     READY   STATUS             RESTARTS   AGE
calico-node-rq26q                                        2/2     Running            0          75m
calico-node-vertical-autoscaler-65b4dc7b84-sjfgs         0/1     CrashLoopBackOff   19         76m
calico-typha-59cb487c49-rjvmp                            1/1     Running            0          76m
calico-typha-horizontal-autoscaler-848f6b79df-dfvgr      1/1     Running            0          76m
calico-typha-vertical-autoscaler-dd7d9bdff-srm9h         0/1     CrashLoopBackOff   19         76m
ip-masq-agent-gqs5t                                      1/1     Running            0          75m
kube-dns-5f886bf8d8-f424t                                4/4     Running            0          76m
kube-dns-autoscaler-57d56b4f56-z7xbd                     1/1     Running            0          76m
kube-proxy-gke-cpretzer-dev-default-pool-9c91f151-zjrc   1/1     Running            0          75m
kubernetes-dashboard-85bcf5dbf8-nlkk2                    1/1     Running            0          76m
l7-default-backend-8f479dd9-64f6z                        1/1     Running            0          76m
metrics-server-v0.3.1-8d4c5db46-tvdsp                    2/2     Running            0          76m
tiller-deploy-8458f6c667-xdl4t                           1/1     Running            0          76m

The logs from the calico pods show this:

cpretzer@kashyyyk: $ kubectl logs -f calico-node-vertical-autoscaler-65b4dc7b84-sjfgs -n kube-system
I0920 20:56:01.152074       1 autoscaler.go:46] Scaling namespace: kube-system, target: daemonset/calico-node
E0920 20:56:01.329593       1 autoscaler.go:49] unknown target kind: Tap
cpretzer@kashyyyk: $ kubectl logs -f calico-typha-vertical-autoscaler-dd7d9bdff-srm9h -n kube-system
I0920 20:56:40.210560       1 autoscaler.go:46] Scaling namespace: kube-system, target: deployment/calico-typha
E0920 20:56:40.340265       1 autoscaler.go:49] unknown target kind: Tap

My next steps will be to look in to the autoscaler.go code to understand what the unknown target kind error means.

cpretzer commented 4 years ago

@michalschott I've done some digging through the code and I think that the unknown target kind error might not be related: https://github.com/kubernetes-incubator/cluster-proportional-vertical-autoscaler/blob/master/pkg/autoscaler/k8sclient/k8sclient.go#L117

I'll keep working to reproduce this and post updates here

cpretzer commented 4 years ago

@michalschott I see that you're using calico for CNI. Have you installed the linkerd-cni plugin with the linkerd install-cni command?

michalschott commented 4 years ago

@cpretzer Yes. It is needed due to non-root psp.

cpretzer commented 4 years ago

@michalschott I've taken the steps below to reproduce the behavior that you outlined in this issue.

The TL;DR is that the behavior is a symptom of how the instances are shutdown as part of the AWS Auto Scaling Groups lifecycle. In short, autoscaling the cluster does not include gracefully shutting down the kubernetes cluster by draining the nodes.

You have a couple of options to address this: 1) Use Auto Scaling Lifecycle Hooks 2) A more kubernetes-centric way is to use the kubernetes/autoscaler project. The drawback to this is that the scaling is resource and not time based, so you won't be able to shut down the servers and start them at a specific time.

My steps to reproduce

For the autoscaling piece, I tried a few different ways and got different results. First, I set up a Scheduled Action which scales the group to 0 at 5, 35, and 55 minutes after the hour. To complement this, I set up a Schedule Action which scales the group out to 2 (I also tried 3 and 4) at minutes 0, 10, and 40.

When the ASG scaled up, I saw behavior identical or similar to yours, based on a few different knobs that I turned. Those knobs are:

Please have a read through the details here and let me know your thoughts. I suspect that I could through the same steps without linkerd and get other problematic behavior. In short, this is a combination of a race condition without having shut down the kubernetes clusters gracefully, so etcd is left in a state that does not allow the cluster to restart cleanly.

michalschott commented 4 years ago

@cpretzer so to avoid any race conditions, we're shutting down workers first and after 15 mins master node goes down as well.

For spinning up, we have similar approach - bring master first then wait 15 mins before spinning up any workers.

cpretzer commented 4 years ago

@michalschott I tested that scenario as well and it's good that you're thinking about staggering shutdown and startup of the master and worker nodes. From a Kubernetes perspective, I don't think that this is sufficient to address potential race conditions, because the ASG is shutting down the operating system, but not gracefully shutting down the kubernetes clusters by tainting the nodes.

I think that if we looked at the etcd store for a cluster that was shutdown by the ASG and a node that is shutdown gracefully, we will find that there are state differences that ensure that the latter starts properly.

cpretzer commented 4 years ago

Removing this from P0 because AWS ASG shutdown lifecycle behavior isn't terminating the cluster in a way that workloads expect it to, possibly leaving the cluster in a bad state.

michalschott commented 4 years ago

@cpretzer Did some scaling activities with watch attached and I can see indeed various races in there. I still think these retries could be handled better, maybe some liveness probes tweaks?

cpretzer commented 4 years ago

@michalschott tweaking the liveness and readiness probes is worth a try.

I'm still have concerns about the cluster nodes not being properly drained during the shutdown process. For a truly reliable system, I suggest looking in to the options I mentioned previously:

You have a couple of options to address this:

  • Use Auto Scaling Lifecycle Hooks
  • A more kubernetes-centric way is to use the kubernetes/autoscaler project. The drawback to this is that the scaling is resource and not time based, so you won't be able to shut down the servers and start them at a specific time.
cpretzer commented 4 years ago

@michalschott have you had any success with configuring your environment to properly drain the nodes before shutting them down?

Did you run any tests after adjusting the liveness and readiness probes?

michalschott commented 4 years ago

@cpretzer Hey, sorry for not providing update.

We've switched to kube-downscaler, keeping master nodes and linkerd ns up all the time - no problems so far.

michalschott commented 4 years ago

@cpretzer Hi again.

We've upgraded clusters to 1.14.8 / calico 3.7.5 / linkerd 2.6.0 and like I've mentioned downscaling is performed by kube-downscaler (linkerd ns is excluded) + aws-cluster-autoscaler.

We're still observing random failures like this one: linkerd-prometheus 2/2 Ready

linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T11:00:05.342Z caller=head.go:656 component=tsdb msg="WAL checkpoint complete" first=99 last=102 duration=2.753006292s
linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T13:00:02.088Z caller=compact.go:495 component=tsdb msg="write block" mint=1573466400000 maxt=1573473600000 ulid=01DSD8SCAGPA7D5DV6JTG3ANES duration=1.879762869s
linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T13:00:02.490Z caller=head.go:586 component=tsdb msg="head GC completed" duration=175.227074ms
linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T13:00:05.140Z caller=head.go:656 component=tsdb msg="WAL checkpoint complete" first=103 last=106 duration=2.649028617s

public-api linkerd-controller-5f596c7b96-f8vhg linkerd-proxy INFO [     0.009034s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 587, 3306}
time="2019-11-11T13:13:35Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-5f596c7b96-f8vhg linkerd-proxy INFO [     0.115572s] linkerd2_proxy::app::main Certified identity: linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local
linkerd-controller-5f596c7b96-f8vhg linkerd-proxy WARN [    16.269807s] linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline

linkerd check:
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
    Error calling Prometheus from the control plane: server_error: server error: 503
    see https://linkerd.io/checks/#l5d-api-control-api for hints

fixed after restarting prometheus

cpretzer commented 4 years ago

thanks for this report @michalschott

I have a system that appears to be in the same state, and I'll update when I have more info!

michalschott commented 4 years ago

@cpretzer I think there might be race conditions between linkerd-cni and other pods. For now I've patched linkerd-cni ds with way higher priorityclass and it somehow survived overnight.

cpretzer commented 4 years ago

@michalschott that's a really interesting find! Do you have logs or other info that demonstrate the race condition? A reproducible test case would be ideal.

Would you mind sharing your changes in a pull request?

michalschott commented 4 years ago

To try reproduce the issue I'd follow:

I think this is where race condition happens - sometimes masterplane pods might be scheduled before linkerd-cni ds (because of equal priority) thus routing/iptables is not set properly to use CNI overlay before other services have started.

In terms of PR I don't really have proper solution for helm, because we're using kustomize 1.X (old one).

➜  apps git:(develop) ls -l linkerd-cni
total 20
-rw-r--r-- 1 ms 471 lis  9 16:24 affinity.yaml
-rw-r--r-- 1 ms 133 lis 12 16:34 kustomization.yaml
-rw-r--r-- 1 ms 209 lis  9 16:24 namespace.yaml
-rw-r--r-- 1 ms 135 lis 12 16:34 priorityclass-patch.yaml
-rw-r--r-- 1 ms 148 lis 12 16:34 priorityclass.yaml

➜  apps git:(develop) cat linkerd-cni/kustomization.yaml
---
resources:
  - daemonset.yaml
  - priorityclass.yaml

patches:
  - affinity.yaml
  - namespace.yaml
  - priorityclass-patch.yaml

➜  apps git:(develop) cat linkerd-cni/priorityclass.yaml
---
apiVersion: scheduling.k8s.io/v1
description: PriorityClass for linkerd-cni
kind: PriorityClass
metadata:
  name: linkerd-cni
value: 1000000000

➜  apps git:(develop) cat linkerd-cni/namespace.yaml
---
kind: Namespace
apiVersion: v1
metadata:
  name: linkerd
  annotations:
    linkerd.io/inject: disabled
  labels:
    linkerd.io/is-control-plane: "true"
    config.linkerd.io/admission-webhooks: disabled

➜  apps git:(develop) cat linkerd-cni/priorityclass-patch.yaml
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: linkerd-cni
spec:
  template:
    spec:
      priorityClassName: linkerd-cni

Not sharing affinity.yaml because this is cluster relevant.

So install/upgrade process looks like:

linkerd install-cni > apps/linkerd-cni/daemonset.yaml
kustomize build . | k apply -f -

Since added priorityClass and priorityClassPath this is 3rd night we had no linkerd masterplane outage.

We were previously looking forward to use kube-node-ready-controller from Zalando (so ie. our apps can not be scheduled on nodes where log forwarder isnt in ready state) but back then I think we had too old k8s version and could not run this without codechanges which were rejected from codeowners.

cpretzer commented 4 years ago

thanks @michalschott I'm working on reproducing another issue, and will attempt to reproduce this soon.

In the meantime, it sounds like the priorityClass and priorityClassName are reasonable workarounds.

klingerf commented 4 years ago

I ran into the "unknown target kind: Tap" calico issue as well, on a GKE cluster with preemptible nodes enabled. When the node is preempted and a new one starts, the calico vertical autoscaler pods end up in crashloopbackoff, with:

I1121 19:53:20.424748       1 autoscaler.go:46] Scaling namespace: kube-system, target: daemonset/calico-node
E1121 19:53:20.574203       1 autoscaler.go:49] unknown target kind: Tap

Best guess is that there's an ordering issue with the v1alpha1.tap.linkerd.io APIService that the control plane installs not being visible to the calico pods when they are starting.

Additional info about his cluster:

$ linkerd version
Client version: edge-19.11.2
Server version: edge-19.11.2

$ kubectl version --short
Client Version: v1.16.2
Server Version: v1.15.4-gke.18
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

cpretzer commented 4 years ago

I think that the priorityClass and priorityClassName configurations should be sufficient to close this issue. In light of this calico issue and having looked through the cluster vertical autoscaler code, I think that this might be something that needs to be addressed upstream.

The other option is that we can rename the API to TapDaemonset and TapDeployment for each respective resource type in order to avoid the error.

HaithemSouala commented 4 years ago

I think that the priorityClass and priorityClassName configurations should be sufficient to close this issue. In light of this calico issue and having looked through the cluster vertical autoscaler code, I think that this might be something that needs to be addressed upstream.

The other option is that we can rename the API to TapDaemonset and TapDeployment for each respective resource type in order to avoid the error.

Any more details about that workaround? how to implement that?

cpretzer commented 4 years ago

Hi @HaithemSouala

@michalschott gave a great description of the priorityClass and priorityClassName configurations that they created in this comment.

We can chat on Linkerd Slack, if that will help you with your configurations.

HaithemSouala commented 4 years ago

Hi @HaithemSouala

@michalschott gave a great description of the priorityClass and priorityClassName configurations that they created in this comment.

We can chat on Linkerd Slack, if that will help you with your configurations.

Hi @cpretzer,

I posted a message on Slack.

cpretzer commented 4 years ago

@HaithemSouala @michalschott @daveio

@ialidzhikov wrote a fix for this and it looks like the cpvpa team is planning to release this soon as v0.8.2

I reproduced the error GKE 1.15.9, then built the a new image from master, and updated the calico-node-vertical-autoscaler and calico-typha-vertical-autoscaler Deployment resources to use the new image and the issue was fixed.

Please keep an eye out for the official cpvpa v0.8.2 release so that you can update the images in your clusters.

cpretzer commented 4 years ago

Closing this because v0.8.2 has been released.