Closed michalschott closed 4 years ago
This appears to be an ordering issue resulting from scaling the cluster to zero. I would assume that proxies would still retry to get their certificates though until identity
comes online.
That's correct, we're scaling down all worker nodes to 0.
Proxies still are trying to reach identity service, but failing until identity pods are manually restarted.
Oh, that's an interesting tidbit. The identity pod comes up after you've scaled up ... but that needs to be restarted again to make everything work?
That's correct.
@michalschott I'll work on reproducing this and let you know what I find.
@cpretzer any progress on this?
@michalschott
Sorry for the delay in the update to this issue. TL:DR: I ran a test on GKE and I think I've reproduced the issue. It found an error message in the calico pods which I have to explore further.
The Details:
Using the booksapp demo, the first interesting thing that I found is that the app continues to service requests. What I mean by that is that the page for booksapp loads in my browser, despite the state of the pods as described below.
In order to reproduce this, I created a three node cluster in GKE and deployed linkerd and the booksapp demo. I then scaled the cluster to 0 nodes using the gcloud
cli, then scaled it back up to three nodes. This makes an assumption that it's not the particular implementation (EKS vs GKE) that causes the errors, but rather the restarting of the pods in the cluster.
That being said, I see the following output from the proxy logs (I'm not using stern in my tests):
time="2019-09-20T19:55:08Z" level=info msg="running version stable-2.5.0"
INFO [ 0.005047s] linkerd2_proxy::app::main using destination service at Some(ControlAddr { addr: Name(NameAddr { name: "linkerd-destination.linkerd.svc.cluster.local", port: 8086 }), identity: Some("linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local") })
INFO [ 0.005182s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "localhost.", port: 8080 })
INFO [ 0.005213s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
INFO [ 0.005232s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
INFO [ 0.005285s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
INFO [ 0.005309s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
INFO [ 0.005333s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
INFO [ 0.024629s] linkerd2_proxy::app::main Certified identity: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
WARN [ 15.147864s] linkerd-destination.linkerd.svc.cluster.local:8086 linkerd2_proxy::proxy::reconnect connect error to ControlAddr { addr: Name(NameAddr { name: "linkerd-destination.linkerd.svc.cluster.local", port: 8086 }), identity: Some("linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local") }: request timed out
WARN [ 19.626004s] linkerd2_proxy::app::profiles error fetching profile for linkerd-identity.linkerd.svc.cluster.local:8080: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
WARN [ 27.629495s] linkerd2_proxy::app::profiles error fetching profile for linkerd-identity.linkerd.svc.cluster.local:8080: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
error: unexpected EOF
kubectl logs -n linkerd deploy/linkerd-controller -c linkerd-proxy
time="2019-09-20T19:55:08Z" level=info msg="running version stable-2.5.0"
INFO [ 0.015205s] linkerd2_proxy::app::main using destination service at Some(ControlAddr { addr: Name(NameAddr { name: "localhost.", port: 8086 }), identity: None(NoPeerName(Loopback)) })
INFO [ 0.015298s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "linkerd-identity.linkerd.svc.cluster.local", port: 8080 })
INFO [ 0.015304s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
INFO [ 0.015311s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
INFO [ 0.015318s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
INFO [ 0.015321s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
INFO [ 0.015329s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
ERR! [ 5.022250s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
INFO [ 15.083243s] linkerd2_proxy::app::main Certified identity: linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local
The linkerd pods look healthy:
kubectl get po -n linkerd
NAME READY STATUS RESTARTS AGE
linkerd-controller-5bbdcc47c-k56v2 3/3 Running 0 75m
linkerd-grafana-57b9ccc985-bs5qg 2/2 Running 0 75m
linkerd-identity-67b46c77fb-v28fc 2/2 Running 0 75m
linkerd-prometheus-6c454f4976-rdbnd 2/2 Running 0 75m
linkerd-proxy-injector-7f75bcdfb7-t4f7l 2/2 Running 0 75m
linkerd-sp-validator-54d69c97db-h42zx 2/2 Running 0 75m
linkerd-tap-5d4598f4b4-dwmqb 2/2 Running 0 75m
linkerd-web-748bccc688-gfd7c 2/2 Running 0 75m
At this point, the relevant information that I see comes from the pods in the kube-system
namespace:
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
calico-node-rq26q 2/2 Running 0 75m
calico-node-vertical-autoscaler-65b4dc7b84-sjfgs 0/1 CrashLoopBackOff 19 76m
calico-typha-59cb487c49-rjvmp 1/1 Running 0 76m
calico-typha-horizontal-autoscaler-848f6b79df-dfvgr 1/1 Running 0 76m
calico-typha-vertical-autoscaler-dd7d9bdff-srm9h 0/1 CrashLoopBackOff 19 76m
ip-masq-agent-gqs5t 1/1 Running 0 75m
kube-dns-5f886bf8d8-f424t 4/4 Running 0 76m
kube-dns-autoscaler-57d56b4f56-z7xbd 1/1 Running 0 76m
kube-proxy-gke-cpretzer-dev-default-pool-9c91f151-zjrc 1/1 Running 0 75m
kubernetes-dashboard-85bcf5dbf8-nlkk2 1/1 Running 0 76m
l7-default-backend-8f479dd9-64f6z 1/1 Running 0 76m
metrics-server-v0.3.1-8d4c5db46-tvdsp 2/2 Running 0 76m
tiller-deploy-8458f6c667-xdl4t 1/1 Running 0 76m
The logs from the calico pods show this:
cpretzer@kashyyyk: $ kubectl logs -f calico-node-vertical-autoscaler-65b4dc7b84-sjfgs -n kube-system
I0920 20:56:01.152074 1 autoscaler.go:46] Scaling namespace: kube-system, target: daemonset/calico-node
E0920 20:56:01.329593 1 autoscaler.go:49] unknown target kind: Tap
cpretzer@kashyyyk: $ kubectl logs -f calico-typha-vertical-autoscaler-dd7d9bdff-srm9h -n kube-system
I0920 20:56:40.210560 1 autoscaler.go:46] Scaling namespace: kube-system, target: deployment/calico-typha
E0920 20:56:40.340265 1 autoscaler.go:49] unknown target kind: Tap
My next steps will be to look in to the autoscaler.go
code to understand what the unknown target kind
error means.
@michalschott I've done some digging through the code and I think that the unknown target kind
error might not be related: https://github.com/kubernetes-incubator/cluster-proportional-vertical-autoscaler/blob/master/pkg/autoscaler/k8sclient/k8sclient.go#L117
I'll keep working to reproduce this and post updates here
@michalschott I see that you're using calico for CNI. Have you installed the linkerd-cni
plugin with the linkerd install-cni
command?
@cpretzer Yes. It is needed due to non-root psp.
@michalschott I've taken the steps below to reproduce the behavior that you outlined in this issue.
The TL;DR is that the behavior is a symptom of how the instances are shutdown as part of the AWS Auto Scaling Groups lifecycle. In short, autoscaling the cluster does not include gracefully shutting down the kubernetes cluster by draining the nodes.
You have a couple of options to address this: 1) Use Auto Scaling Lifecycle Hooks 2) A more kubernetes-centric way is to use the kubernetes/autoscaler project. The drawback to this is that the scaling is resource and not time based, so you won't be able to shut down the servers and start them at a specific time.
--zones us-west-2a \
--networking calico ${NAME} \
--node-count 3 \
--kubernetes-version=1.12.0
linkerd check --pre
linkerd install-cni
linkerd install --linkerd-cni-enabled
kubectl annotate ns default linkerd.io/inject=enabled
kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/booksapp/master/booksapp.yml
For the autoscaling piece, I tried a few different ways and got different results. First, I set up a Scheduled Action which scales the group to 0 at 5, 35, and 55 minutes after the hour. To complement this, I set up a Schedule Action which scales the group out to 2 (I also tried 3 and 4) at minutes 0, 10, and 40.
When the ASG scaled up, I saw behavior identical or similar to yours, based on a few different knobs that I turned. Those knobs are:
--ha
flagPlease have a read through the details here and let me know your thoughts. I suspect that I could through the same steps without linkerd and get other problematic behavior. In short, this is a combination of a race condition without having shut down the kubernetes clusters gracefully, so etcd is left in a state that does not allow the cluster to restart cleanly.
@cpretzer so to avoid any race conditions, we're shutting down workers first and after 15 mins master node goes down as well.
For spinning up, we have similar approach - bring master first then wait 15 mins before spinning up any workers.
@michalschott I tested that scenario as well and it's good that you're thinking about staggering shutdown and startup of the master and worker nodes. From a Kubernetes perspective, I don't think that this is sufficient to address potential race conditions, because the ASG is shutting down the operating system, but not gracefully shutting down the kubernetes clusters by tainting the nodes.
I think that if we looked at the etcd store for a cluster that was shutdown by the ASG and a node that is shutdown gracefully, we will find that there are state differences that ensure that the latter starts properly.
Removing this from P0 because AWS ASG shutdown lifecycle behavior isn't terminating the cluster in a way that workloads expect it to, possibly leaving the cluster in a bad state.
@cpretzer Did some scaling activities with watch attached and I can see indeed various races in there. I still think these retries could be handled better, maybe some liveness probes tweaks?
@michalschott tweaking the liveness and readiness probes is worth a try.
I'm still have concerns about the cluster nodes not being properly drained during the shutdown process. For a truly reliable system, I suggest looking in to the options I mentioned previously:
You have a couple of options to address this:
- Use Auto Scaling Lifecycle Hooks
- A more kubernetes-centric way is to use the kubernetes/autoscaler project. The drawback to this is that the scaling is resource and not time based, so you won't be able to shut down the servers and start them at a specific time.
@michalschott have you had any success with configuring your environment to properly drain the nodes before shutting them down?
Did you run any tests after adjusting the liveness and readiness probes?
@cpretzer Hey, sorry for not providing update.
We've switched to kube-downscaler, keeping master nodes and linkerd ns up all the time - no problems so far.
@cpretzer Hi again.
We've upgraded clusters to 1.14.8 / calico 3.7.5 / linkerd 2.6.0 and like I've mentioned downscaling is performed by kube-downscaler (linkerd ns is excluded) + aws-cluster-autoscaler.
We're still observing random failures like this one:
linkerd-prometheus 2/2 Ready
linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T11:00:05.342Z caller=head.go:656 component=tsdb msg="WAL checkpoint complete" first=99 last=102 duration=2.753006292s
linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T13:00:02.088Z caller=compact.go:495 component=tsdb msg="write block" mint=1573466400000 maxt=1573473600000 ulid=01DSD8SCAGPA7D5DV6JTG3ANES duration=1.879762869s
linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T13:00:02.490Z caller=head.go:586 component=tsdb msg="head GC completed" duration=175.227074ms
linkerd-prometheus-596555f645-c2smj prometheus level=info ts=2019-11-11T13:00:05.140Z caller=head.go:656 component=tsdb msg="WAL checkpoint complete" first=103 last=106 duration=2.649028617s
public-api linkerd-controller-5f596c7b96-f8vhg linkerd-proxy INFO [ 0.009034s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 587, 3306}
time="2019-11-11T13:13:35Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-5f596c7b96-f8vhg linkerd-proxy INFO [ 0.115572s] linkerd2_proxy::app::main Certified identity: linkerd-controller.linkerd.serviceaccount.identity.linkerd.cluster.local
linkerd-controller-5f596c7b96-f8vhg linkerd-proxy WARN [ 16.269807s] linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline
linkerd check:
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
Error calling Prometheus from the control plane: server_error: server error: 503
see https://linkerd.io/checks/#l5d-api-control-api for hints
fixed after restarting prometheus
thanks for this report @michalschott
I have a system that appears to be in the same state, and I'll update when I have more info!
@cpretzer I think there might be race conditions between linkerd-cni and other pods. For now I've patched linkerd-cni ds with way higher priorityclass and it somehow survived overnight.
@michalschott that's a really interesting find! Do you have logs or other info that demonstrate the race condition? A reproducible test case would be ideal.
Would you mind sharing your changes in a pull request?
To try reproduce the issue I'd follow:
I think this is where race condition happens - sometimes masterplane pods might be scheduled before linkerd-cni ds (because of equal priority) thus routing/iptables is not set properly to use CNI overlay before other services have started.
In terms of PR I don't really have proper solution for helm, because we're using kustomize 1.X (old one).
➜ apps git:(develop) ls -l linkerd-cni
total 20
-rw-r--r-- 1 ms 471 lis 9 16:24 affinity.yaml
-rw-r--r-- 1 ms 133 lis 12 16:34 kustomization.yaml
-rw-r--r-- 1 ms 209 lis 9 16:24 namespace.yaml
-rw-r--r-- 1 ms 135 lis 12 16:34 priorityclass-patch.yaml
-rw-r--r-- 1 ms 148 lis 12 16:34 priorityclass.yaml
➜ apps git:(develop) cat linkerd-cni/kustomization.yaml
---
resources:
- daemonset.yaml
- priorityclass.yaml
patches:
- affinity.yaml
- namespace.yaml
- priorityclass-patch.yaml
➜ apps git:(develop) cat linkerd-cni/priorityclass.yaml
---
apiVersion: scheduling.k8s.io/v1
description: PriorityClass for linkerd-cni
kind: PriorityClass
metadata:
name: linkerd-cni
value: 1000000000
➜ apps git:(develop) cat linkerd-cni/namespace.yaml
---
kind: Namespace
apiVersion: v1
metadata:
name: linkerd
annotations:
linkerd.io/inject: disabled
labels:
linkerd.io/is-control-plane: "true"
config.linkerd.io/admission-webhooks: disabled
➜ apps git:(develop) cat linkerd-cni/priorityclass-patch.yaml
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: linkerd-cni
spec:
template:
spec:
priorityClassName: linkerd-cni
Not sharing affinity.yaml
because this is cluster relevant.
So install/upgrade process looks like:
linkerd install-cni > apps/linkerd-cni/daemonset.yaml
kustomize build . | k apply -f -
Since added priorityClass and priorityClassPath this is 3rd night we had no linkerd masterplane outage.
We were previously looking forward to use kube-node-ready-controller
from Zalando (so ie. our apps can not be scheduled on nodes where log forwarder isnt in ready state) but back then I think we had too old k8s version and could not run this without codechanges which were rejected from codeowners.
thanks @michalschott I'm working on reproducing another issue, and will attempt to reproduce this soon.
In the meantime, it sounds like the priorityClass
and priorityClassName
are reasonable workarounds.
I ran into the "unknown target kind: Tap" calico issue as well, on a GKE cluster with preemptible nodes enabled. When the node is preempted and a new one starts, the calico vertical autoscaler pods end up in crashloopbackoff, with:
I1121 19:53:20.424748 1 autoscaler.go:46] Scaling namespace: kube-system, target: daemonset/calico-node
E1121 19:53:20.574203 1 autoscaler.go:49] unknown target kind: Tap
Best guess is that there's an ordering issue with the v1alpha1.tap.linkerd.io
APIService that the control plane installs not being visible to the calico pods when they are starting.
Additional info about his cluster:
$ linkerd version
Client version: edge-19.11.2
Server version: edge-19.11.2
$ kubectl version --short
Client Version: v1.16.2
Server Version: v1.15.4-gke.18
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
I think that the priorityClass
and priorityClassName
configurations should be sufficient to close this issue. In light of this calico issue and having looked through the cluster vertical autoscaler code, I think that this might be something that needs to be addressed upstream.
The other option is that we can rename the API to TapDaemonset
and TapDeployment
for each respective resource type in order to avoid the error.
I think that the
priorityClass
andpriorityClassName
configurations should be sufficient to close this issue. In light of this calico issue and having looked through the cluster vertical autoscaler code, I think that this might be something that needs to be addressed upstream.The other option is that we can rename the API to
TapDaemonset
andTapDeployment
for each respective resource type in order to avoid the error.
Any more details about that workaround? how to implement that?
Hi @HaithemSouala
@michalschott gave a great description of the priorityClass
and priorityClassName
configurations that they created in this comment.
We can chat on Linkerd Slack, if that will help you with your configurations.
Hi @HaithemSouala
@michalschott gave a great description of the
priorityClass
andpriorityClassName
configurations that they created in this comment.We can chat on Linkerd Slack, if that will help you with your configurations.
Hi @cpretzer,
I posted a message on Slack.
@HaithemSouala @michalschott @daveio
@ialidzhikov wrote a fix for this and it looks like the cpvpa team is planning to release this soon as v0.8.2
I reproduced the error GKE 1.15.9, then built the a new image from master, and updated the calico-node-vertical-autoscaler
and calico-typha-vertical-autoscaler
Deployment resources to use the new image and the issue was fixed.
Please keep an eye out for the official cpvpa v0.8.2 release so that you can update the images in your clusters.
Closing this because v0.8.2
has been released.
Bug Report
What is the issue?
Kubernetes 1.12.10 + Calico CNI 3.7.4 (KOPS 1.12.2),obviously on AWS.
Currently running edge-19.8.7, but we had similar problem with versions 2.3.0, 2.3.1, 2.3.2, 2.4.0 and 2.5.0. Happens with --ha control plane installed as well.
Cluster is basicaly non HA - single AZ, single master node and few workers. To save some money, we're using ASG scheduled scaling to scale ASGs to 0 and bring them back on for working hours.
Every morning we face the same problem:
After restarting linkerd-identity pod (
k -n linkerd delete pod
), need to wait few secs:And it starts to work. So my understanding is retries are not being handled proprely?
Attaching linkerd configmaps: