Problems with HA on lost node

falcoriss commented 5 years ago

Bug Report

What is the issue?

It seems that the HA doesn't work properly when a k8s node is lost (out of memory, high load, kernel panic, etc..) The issue seems to apply the "destination" and "proxy injector" components but could apply to others.

How can it be reproduced?

1 - Check which node a replica of the component "proxy injector" is running on and shut it down (a simple poweroff is enough). The pod will wait as long as the "pod-eviction-timeout" states before it gets tagged as unknown and gets rescheduled (default : 5m) 2 - During this time, the pod is tagged as "Ready: False" by k8s, endpoints of the service get updated, but if you delete a meshed pod, it will not be rescheduled (because the default of linkerd-proxy-injector-webhook-config is "failurePolicy: Fail". If you switch this value to "Ignore", the pod will be scheduled but unmeshed.

Logs, error output, etc

[linkerd-identity-6758697995-m2w2q linkerd-proxy] WARN [ 2477.529728s] linkerd-dst.linkerd.svc.k8s-predev.bee-labs.net:8086 linkerd2_reconnect::service Service failed error=channel closed the-deleted-replica-74558f4547-rdzbl 0/1 Running 0 37s 10.233.113.238 sucv-k8s-worker04.predev.bee-labs.net the-not-deleted-replica-74558f4547-sljr7 2/2 Running 0 13m 10.233.94.198 sucv-k8s-worker05.predev.bee-labs.net

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
× can initialize the client
    error upgrading connection: error dialing backend: dial tcp 10.0.4.142:10250: connect: no route to host
    see https://linkerd.io/checks/#l5d-existence-client for hints

Status check results are ×

It usually is all green but this happens because the node it's trying to join is the one i shut down. After i boot it back up :

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
    Error calling Prometheus from the control plane: server_error: server error: 503
    see https://linkerd.io/checks/#l5d-api-control-api for hints

Status check results are ×

But all pods are up :

NAME                                      READY   STATUS    RESTARTS   AGE
linkerd-controller-76cfbf4bd4-dqhqn       3/3     Running   0          30m
linkerd-controller-76cfbf4bd4-rm6bg       3/3     Running   0          10m
linkerd-controller-76cfbf4bd4-sv8tj       3/3     Running   0          56m
linkerd-destination-5b986b988d-dpn4d      2/2     Running   0          56m
linkerd-destination-5b986b988d-krlrl      2/2     Running   0          30m
linkerd-destination-5b986b988d-sp8lk      2/2     Running   0          10m
linkerd-grafana-7b8d5d7b7d-4c9rz          2/2     Running   0          30m
linkerd-identity-6758697995-5vvtt         2/2     Running   0          10m
linkerd-identity-6758697995-m2w2q         2/2     Running   0          56m
linkerd-identity-6758697995-w2qlm         2/2     Running   0          30m
linkerd-prometheus-6c6b987b49-f5sp9       2/2     Running   0          10m
linkerd-proxy-injector-78d78864bb-24wvq   2/2     Running   0          30m
linkerd-proxy-injector-78d78864bb-bhmv2   2/2     Running   0          56m
linkerd-proxy-injector-78d78864bb-xgx5t   2/2     Running   0          10m
linkerd-sp-validator-74b497d499-4pwnd     2/2     Running   0          10m
linkerd-sp-validator-74b497d499-d9n4k     2/2     Running   0          30m
linkerd-sp-validator-74b497d499-rm9g4     2/2     Running   0          56m
linkerd-tap-5bffd9c666-bsmfc              2/2     Running   0          56m
linkerd-tap-5bffd9c666-fqfvh              2/2     Running   0          30m
linkerd-tap-5bffd9c666-r958k              2/2     Running   0          10m
linkerd-web-56578968f-zclt7               2/2     Running   0          10m

Environment

Kubernetes Version: 1.12.3
Cluster Environment: kubespray on vmware VMs
Host OS: ubuntu 16.04
Linkerd version: 2.6

Possible solution

Additional context

We have problems when a nodes crash and are trying to solve that out
We also have problems with "destination" but it's harder to reproduce than the proxy injectors so i just hope that the problem has a common source =)
We do not have the problem if we scale the replicas down to 0 or if we delete pods, only on lost node.

grampelberg commented 5 years ago

On the proxy injector side, it sounds like you might be running into this kubernetes issue. What's in the logs for api-server? Is it contacting the correct pods or still routing to the wrong place?

cpretzer commented 5 years ago

@falcoriss I've been trying to reproduce this on a cluster and am not seeing the issues that you describe.

My environment is slightly different, though, so I think it's worth digging in to the Kubernetes cluster configuration and logs to see if we can figure out the cause.

When you simulate the error by powering down the node, can you collect the kubelet logs from the other nodes?

Which network plugin are you using? Have you checked the log output from the network plugin pods?

In the original post, there is an example of a pod from a deleted replica that isn't starting:

the-deleted-replica-74558f4547-rdzbl 0/1 Running 0 37s 10.233.113.238 sucv-k8s-worker04.predev.bee-labs.net

Is there any log output from that pod? What is the output from kubectl get events in the linkerd namespace and the namespace where the application is running?

falcoriss commented 5 years ago

On the proxy injector side, it sounds like you might be running into this kubernetes issue. What's in the logs for api-server? Is it contacting the correct pods or still routing to the wrong place?

This is what we get from the api-server when the node goes down

[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1022 07:46:43.408998       1 available_controller.go:311] v1alpha1.tap.linkerd.io failed with: Get https://10.233.43.63:443: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1022 07:46:43.466882       1 available_controller.go:311] v1alpha1.tap.linkerd.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.tap.linkerd.io": the object has been modified; please apply your changes to the latest version and try again 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1022 07:46:43.998088       1 memcache.go:134] couldn't get resource list for tap.linkerd.io/v1alpha1: the server is currently unable to handle the request 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] I1022 07:46:47.541008       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.tap.linkerd.io 
[kube-apiserver-sucv-k8s-master02.predev.bee-labs.net] I1022 07:46:47.641448       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.tap.linkerd.io 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] I1022 07:46:47.657622       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.tap.linkerd.io 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1022 07:46:48.415306       1 available_controller.go:311] v1alpha1.tap.linkerd.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.tap.linkerd.io": the object has been modified; please apply your changes to the latest version and try again 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1022 07:46:51.277151       1 controller.go:111] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: ERROR $root.paths./apis/tap.linkerd.io.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master02.predev.bee-labs.net] E1022 07:46:51.280774       1 controller.go:111] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: ERROR $root.paths./apis/tap.linkerd.io.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] ERROR $root.paths./apis/tap.linkerd.io/v1alpha1.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master02.predev.bee-labs.net] ERROR $root.paths./apis/tap.linkerd.io/v1alpha1.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] I1022 07:46:51.277192       1 controller.go:119] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue. 
[kube-apiserver-sucv-k8s-master02.predev.bee-labs.net] I1022 07:46:51.280808       1 controller.go:119] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue. 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1022 07:46:51.381420       1 controller.go:111] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: ERROR $root.paths./apis/tap.linkerd.io.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] ERROR $root.paths./apis/tap.linkerd.io/v1alpha1.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] I1022 07:46:51.381452       1 controller.go:119] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue. 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1022 07:46:53.421816       1 available_controller.go:311] v1alpha1.tap.linkerd.io failed with: Get https://10.233.43.63:443: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I could see packets going to down endpoints at some point but it's not happening since so i'm not sure what it was.

@cpretzer For the few minutes following the shutdown of the node, the only log i've got on kubelet of another node is this :

oct. 22 17:23:49 myhost.mydomain.lan kubelet[1971]: I1022 17:23:49.031477    1971 setters.go:72] Using node IP: "10.0.4.141"
oct. 22 17:23:59 myhost.mydomain.lan kubelet[1971]: I1022 17:23:59.044851    1971 setters.go:72] Using node IP: "10.0.4.141"
oct. 22 17:24:09 myhost.mydomain.lan kubelet[1971]: I1022 17:24:09.057659    1971 setters.go:72] Using node IP: "10.0.4.141"

Not quite usefull i believe :s

Using weave plugin but tried with calico as well and the problem remains.

In the original post, there is an example of a pod from a deleted replica that isn't starting:

It's because it had just been created and hadn't had enough time

Is there any log output from that pod? What is the output from kubectl get events in the linkerd namespace and the namespace where the application is running?

The ouput of that pod is irrelevant, it's running after about 1min. I just put it to show that it was unmeshed. Here are the relevent events from linkerd :

$ kubectl -n linkerd get events | grep -v 'Container image' | grep -v 'Created container' | grep -v 'Started container' | grep -v 'Successfully pulled image' | grep -v 'Created pod:' | grep -v 'Successfully assigned' | grep -v 'pulling image'
LAST SEEN   TYPE      REASON             KIND         MESSAGE
114m        Normal    SandboxChanged     Pod          Pod sandbox changed, it will be killed and re-created.
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 502
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 503
113m        Warning   Unhealthy          Pod          Liveness probe failed: HTTP probe failed with statuscode: 502
113m        Warning   BackOff            Pod          Back-off restarting failed container
114m        Normal    SandboxChanged     Pod          Pod sandbox changed, it will be killed and re-created.
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 502
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 503
113m        Warning   Unhealthy          Pod          Liveness probe failed: HTTP probe failed with statuscode: 502
113m        Warning   BackOff            Pod          Back-off restarting failed container
114m        Normal    SandboxChanged     Pod          Pod sandbox changed, it will be killed and re-created.
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 502
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 503
113m        Warning   Unhealthy          Pod          Liveness probe failed: HTTP probe failed with statuscode: 502
113m        Warning   BackOff            Pod          Back-off restarting failed container
114m        Normal    SandboxChanged     Pod          Pod sandbox changed, it will be killed and re-created.
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 502
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 503
113m        Warning   Unhealthy          Pod          Liveness probe failed: HTTP probe failed with statuscode: 502
113m        Normal    Killing            Pod          Killing container with id docker://proxy-injector:Container failed liveness probe.. Container will be killed and recreated.
114m        Normal    SandboxChanged     Pod          Pod sandbox changed, it will be killed and re-created.
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 503
106m        Warning   FailedMount        Pod          MountVolume.SetUp failed for volume "linkerd-tap-token-jkg4h" : couldn't propagate object cache: timed out waiting for the condition
114m        Normal    SandboxChanged     Pod          Pod sandbox changed, it will be killed and re-created.
113m        Warning   Unhealthy          Pod          Readiness probe failed: HTTP probe failed with statuscode: 503
106m        Warning   FailedMount        Pod          MountVolume.SetUp failed for volume "linkerd-web-token-bgd57" : couldn't propagate object cache: timed out waiting for the condition

As for our app's namespace, there is no relevant logs, only events linked to our actions (pod create, deletes, scale up etc..

I'm really puzzled here, there seems to be something weird with our cluster but i can't really pinpoint where the problem could be. I'd gladly test on another technology than kubespray. I need to have several nodes and to be able to shut them down, if you do know any other technology that would enable us to do that, i'm keen to give it a try.

grampelberg commented 5 years ago

@falcoriss I would expect to see failures in the api-server logs when contacting the proxy-injector. It seems strange that's not there?

falcoriss commented 5 years ago

Maybe I missed something this morning.. Here are the errors you're talking about, made another test just now :

[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1023 15:54:39.142416       1 controller.go:111] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: ERROR $root.paths./apis/tap.linkerd.io.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] ERROR $root.paths./apis/tap.linkerd.io/v1alpha1.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] I1023 15:54:39.142448       1 controller.go:119] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue. 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1023 15:55:56.244922       1 watch.go:233] unable to encode watch object *v1.WatchEvent: http2: stream closed (&streaming.encoder{writer:(*metrics.fancyResponseWriterDelegator)(0xc42afb7c78), encoder:(*versioning.codec)(0xc434cb0c60), buf:(*bytes.Buffer)(0xc433a9da40)}) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1023 15:55:56.244924       1 watch.go:233] unable to encode watch object *v1.WatchEvent: client disconnected (&streaming.encoder{writer:(*metrics.fancyResponseWriterDelegator)(0xc426b48c58), encoder:(*versioning.codec)(0xc43a6ff9e0), buf:(*bytes.Buffer)(0xc44013dc00)}) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1023 15:55:56.245021       1 watch.go:233] unable to encode watch object *v1.WatchEvent: client disconnected (&streaming.encoder{writer:(*metrics.fancyResponseWriterDelegator)(0xc437cd8770), encoder:(*versioning.codec)(0xc435169b00), buf:(*bytes.Buffer)(0xc42fa93ea0)}) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1023 15:55:56.245254       1 watch.go:233] unable to encode watch object *v1.WatchEvent: client disconnected (&streaming.encoder{writer:(*metrics.fancyResponseWriterDelegator)(0xc439994170), encoder:(*versioning.codec)(0xc427709e60), buf:(*bytes.Buffer)(0xc4317ad880)}) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1023 15:55:56.245409       1 watch.go:233] unable to encode watch object *v1.WatchEvent: client disconnected (&streaming.encoder{writer:(*metrics.fancyResponseWriterDelegator)(0xc432cc1768), encoder:(*versioning.codec)(0xc432d8a2d0), buf:(*bytes.Buffer)(0xc42f0291f0)}) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] W1023 15:56:11.272419       1 dispatcher.go:68] Failed calling webhook, failing open linkerd-proxy-injector.linkerd.io: failed calling admission webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1023 15:56:11.272456       1 dispatcher.go:69] failed calling admission webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] I1023 15:56:11.282151       1 trace.go:76] Trace[473279215]: "Create /api/v1/namespaces/kube-system/pods" (started: 2019-10-23 15:55:41.254785534 +0000 UTC m=+23666.636581746) (total time: 30.027338212s): 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] Trace[473279215]: [30.017729647s] [30.017507699s] About to store object in database 
[kube-apiserver-sucv-k8s-master02.predev.bee-labs.net] E1023 15:56:11.309372       1 memcache.go:134] couldn't get resource list for tap.linkerd.io/v1alpha1: Get https://[::1]:6443/apis/tap.linkerd.io/v1alpha1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 
[kube-apiserver-sucv-k8s-master02.predev.bee-labs.net] I1023 15:56:29.771157       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.tap.linkerd.io 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] I1023 15:56:29.821401       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.tap.linkerd.io 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] E1023 15:56:31.831045       1 controller.go:111] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: ERROR $root.paths./apis/tap.linkerd.io.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] ERROR $root.paths./apis/tap.linkerd.io/v1alpha1.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master01.predev.bee-labs.net] I1023 15:56:31.831075       1 controller.go:119] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue. 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] I1023 15:56:33.327561       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.tap.linkerd.io 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1023 15:56:35.226613       1 controller.go:111] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: ERROR $root.paths./apis/tap.linkerd.io.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] ERROR $root.paths./apis/tap.linkerd.io/v1alpha1.get is missing required property: responses 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] I1023 15:56:35.226675       1 controller.go:119] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue. 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] W1023 15:56:40.402525       1 dispatcher.go:68] Failed calling webhook, failing open linkerd-proxy-injector.linkerd.io: failed calling admission webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] E1023 15:56:40.402559       1 dispatcher.go:69] failed calling admission webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] I1023 15:56:40.408668       1 trace.go:76] Trace[838174339]: "Create /api/v1/namespaces/predev/pods" (started: 2019-10-23 15:56:10.401737216 +0000 UTC m=+23691.884467674) (total time: 30.006892828s): 
[kube-apiserver-sucv-k8s-master03.predev.bee-labs.net] Trace[838174339]: [30.000857718s] [30.000750115s] About to store object in database 
[kube-apiserver-sucv-k8s-master02.predev.bee-labs.net] E1023 15:57:13.324040       1 memcache.go:134] couldn't get resource list for tap.linkerd.io/v1alpha1: Get https://[::1]:6443/apis/tap.linkerd.io/v1alpha1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

falcoriss commented 5 years ago

Also tried the solution to disable HTTP2 on the api-server referenced on this issue but beside creating lots of slow queries everywhere, it doesn't improve the problem =)

grampelberg commented 5 years ago

This really looks like something is going funky with your cluster. The connection to linkerd-proxy-injector.linkerd.svc is going to go through a cluster ip and timing out. Have you tried replicating this without linkerd in the mix at all?

cpretzer commented 5 years ago

@falcoriss I set up a new 1.12.3 cluster with weave using kubeadm on EC2, and took the following steps to reproduce the behavior:

Create a booksapp namespace and annotate it with kubectl annotate ns booksapp linkerd.io/inject=enabled
Install [booksapp](https://github.com/BuoyantIO/booksapp/) so that the app is running before linkerd is installed
Run linkerd check --pre to confirm that everything looks good
Run linkerd install --ha | kubectl apply -f - to install linkerd
Run linkerd check and wait until the control plane is running
Run kubectl rollout restart deploy -n booksapp
- Confirm that the booksapp pods are injected with the proxy
- Wait for the booksapp pods to be ready
- Scale the workloads: kubectl -n booksapp deploy webapp books --replicas=12
- Confirm that one or more workloads are distributed to the node that I intend to shutdown (one that includes a linkerd-proxy-injector pod)
Run kubectl get po -A -owide to confirm that:
- The booksapp pods are not meshed
- The kubernetes control plane is running
- The linkerd control plane is deployed to each of the instances for HA
ssh to one of the EC2 instances where linkerd-proxy-injector is running and execute sudo shutdown -h now
observe the pods for the behavior described in this issue

In this case, I have one master and four worker nodes. After stopping the node, the system sat idle for the pod-eviction-timeout value, then the workloads were migrated.

All the pods came up quickly and the booksapp pods were all injected with the proxy.

Another test that I ran while waiting for the pods to be evicted was to manually delete one of the books pods in the booksapp namespace. This pod was immediately migrated without any issues.

The apiserver logs had no mention of timeouts from any linkerd component, including the linkerd-proxy-injector.

Do you have any other validatingwebhookconfigurations or mutatingwebhookconfigurations that might be causing an issue?

Do you have any other controllers or operators installed?

I meant to ask this earlier: how did you install linkerd?

cpretzer commented 5 years ago

@falcoriss please update this issue when you have more information to add

ihcsim commented 4 years ago

@falcoriss fwiw, per #3721, can you check if your kube-system namespace is annotated with config.linkerd.io/admission-webhool: disabled. See here for more info on the annotation.

falcoriss commented 4 years ago

Alright, sorry it took me so long to come back to you guys, I've been really busy on other matters.

I did some more digging here testing out different scenarios and still can reproduce this issue.

I've tried on a single master +3 worker with kubeadm on vsphere VMs just like you @cpretzer and run into the issue

I've tried again on local VMs with virtualbox with kubeadm, not better

I've tried with kind too, same problem

With kubeadm i pause the VMs to simulate a crash

With kind, i pause the container

The schema is always similar : 1 master, 3 nodes. HA installed with 3 replicas of proxy injector, 1 on each node : linkerd version : 2.6

linkerd install --ha --controller-replicas=3 --skip-outbound-ports=25,587,3306,4222,6222,8086,8222,27017 |kubectl apply -f -

I freeze the first node, try deleting some pods, no issue
I unfreeze the node
I freeze the second node, try deleting some pods, no issue
I unfreeze the node
I freeze the third node, try deleting some pods -> They don't get rescheduled (or they do but unmeshed if I change the setting of the "failurePolicy: Fail" to "failurePolicy: Ignore"
I unfreeze the node
Everything is back to normal

Sometimes it happens at the first or second node but it usually does on the last one.

and @ihcsim I've tried with the annotation and it doesn't change anything

If you guys need some more infos to try this out, please tell me

I can record a video of my actions if needed, it'd be easy and short

cpretzer commented 4 years ago

@falcoriss thanks for taking the time to reproduce this.

When you say that you "freeze" the nodes. Are you marking them as unschedulable? I will try to reproduce this following the steps that you outlined.

Did you check the logs of the pods that are not rescheduled? Or did you get the output from the kube-scheduler or kubelet on the node where the pods were not rescheduled?

It would also help to get the output from linkerd check when the pods are not rescheduled.

Thanks again for all your effort on this. It will help to get to root cause.

falcoriss commented 4 years ago

When you say that you "freeze" the nodes. Are you marking them as unschedulable? I will try to reproduce this following the steps that you outlined.

No, I pause them (docker pause, VM interruption) without cordoning them. The goal is to simulate an outage (which has happened to us and lead us to seeing this bug)

Did you check the logs of the pods that are not rescheduled? Or did you get the output from the kube-scheduler or kubelet on the node where the pods were not rescheduled?

pods are scheduled fine but unmeshed if "failurePolicy" is set to "Ignore", nothing wrong in the scheduler logs about that.

It would also help to get the output from linkerd check when the pods are not rescheduled.

Good question, never thought of doing that when it's in the buggy stance. It hangs as follow :

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running

...

grampelberg commented 4 years ago

I'm not sure what we can do here from the linkerd side. It feels like this needs to be a k8s issue specifically. They've provided us tools that either fail transparently (eg. no sidecar) or block the cluster from being created. We can provide some auditing tooling and alert when workloads should be injected but aren't. The only reliable solution is to inject --manual and not use the injector at all.

falcoriss commented 4 years ago

@grampelberg but the problem doesn't only happen with the proxy-injector, it also occurs on destination and web as far as i've tried. I have the impression that the whole HA doesn't work properly (the root cause could be k8s and not linkerd, that I dont' know though). I really invite you guys to test it with kind and emojivoto as it's easy to set up.

grampelberg commented 4 years ago

I must be missing something then. Can you explain exactly what you're seeing again?

falcoriss commented 4 years ago

What i'm seeing is linkerd HA dysfunctions when a node goes nuts.

Simulating a high load node going "not ready", i get weird linkerd control-plane behaviours (pods not being able to be meshed, destination unreachable etc)

To simulate that load, i use VMs or kind and pause the VM or Container

Here's how to reproduce it with "kind" to be fast (problem is also present on VMs installed with kubeadm but testing with kind is faster) :

Get latest release for kind : https://kind.sigs.k8s.io/docs/user/quick-start/

Create kind cluster file :

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker

Install the kind cluster :+1: kind create cluster --config cluster.yml --image kindest/node:v1.15.6 -v5
Install linkerd linkerd install --ha --controller-replicas=3 |kubectl apply -f -
Install emojivoto : https://linkerd.io/2/getting-started/
Mesh emojivoto
Browse the app, making sure everything works fine
Get the output of kubectl get pods -n emojivoto -o wide to know which pods are running on which nodes
freeze the first node docker pause kind-worker
Delete an emojivoto pod running on one of the other 2 nodes
Check if the pod gets rescheduled. If it is not, that's the problem i'm describing right there, if it is, move to next step
unpause de node docker unpause kind-worker
freeze the second node docker pause kind-worker2
Delete an emojivoto pod running on one of the other 2 nodes
Check if the pod gets rescheduled. If it is not, that's the problem i'm describing right there, if it is, move to next step
unpause de node docker unpause kind-worker2
freeze the third node docker pause kind-worker3
Delete an emojivoto pod running on one of the other 2 nodes
Check if the pod gets rescheduled. If it is not, that's the problem i'm describing right there, if it is, you managed to get something i could never get..

I've tried to be as clear as possible but if it still isn't please tell me.

grampelberg commented 4 years ago

Okay, so the root cause of all these problems is that the proxy injector is not running. Because it isn't running, k8s can't create pods. This is a guarantee to ensure when requested pods always join the mesh. The behavior here is a k8s feature that we're using. You'll need to decide what you'd like to optimize for in your cluster. You've got three options:

HA + guaranteed pods - use inject --manual
guaranteed pods - use install -ha
pure HA without guaranteed pods - modify the MWC and switch it from fail to ignore.

As this is all outside of Linkerd, there's not much that we can do to alter those options. Kubernetes has some bugs that make all this worse, which they're working on. I'll be closing this bug out, but if you happen to have any suggestions for ways to improve how it works, we'd love some feature requests!

falcoriss commented 4 years ago

The proxy injector issue here seems to be due to the k8s bug #80313 i mentioned. I've been able to try in 1.17 and couldn't reproduce the problem. I still have an issue hower and will reopen another issue specifically.

grampelberg commented 4 years ago

That's great to hear, thank you @falcoriss.

linkerd / linkerd2