k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.94k stars 2.33k forks source link

Node stuck in NotReady status after temporary interface disconnect #4168

Closed mjrist closed 2 years ago

mjrist commented 3 years ago

Environmental Info: K3s Version: k3s version v1.22.2+k3s1 (10bca343) go version go1.16.8

Node(s) CPU architecture, OS, and Version: Linux somehostname1 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname2 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname3 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 node HA cluster with embedded etcd.

NAME               STATUS   ROLES                       AGE     VERSION
somehostname1   Ready    control-plane,etcd,master   5m      v1.22.2+k3s1
somehostname2   Ready    control-plane,etcd,master   4m      v1.22.2+k3s1
somehostname3   Ready    control-plane,etcd,master   3m47s   v1.22.2+k3s1

Nothing is deployed on the cluster except coredns and local-path-provisioner.

Describe the bug: Node is stuck in NotReady status after temporarily disconnecting interface and reconnecting.

Steps To Reproduce: Set up a three node HA cluster with embedded etcd using k3s version v1.22.2+k3s1.

  1. On one node, bring the interface down: ifdown $interface
  2. Wait till Node goes into NotReady status. Determined by watching kubectl get nodes from one of the remaining 2 nodes.
  3. Bring the interface back up: ifup $interface

The node does not return to Ready, although it appears to be reconnected.

Notes:

kubectl describe node on the "failed" node shows:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.

Expected behavior: When the interface is reconnected the node should return to Ready status.

brandond commented 3 years ago

I see the status, but what do the logs say?

mjrist commented 3 years ago

Here are the logs. Node 1 (somehostname1, 10.200.50.101) disconnected at Oct 06 14:23:55 and brought back up at Oct 06 14:25:55 node1.txt node2.txt node3.txt

mjrist commented 3 years ago

Here are the etcd ids as well (to make sense of the logs).

+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.200.50.102:2379  | 451d87459933f2d3 |   3.5.0 |  5.5 MB |      true |      false |         3 |      14640 |              14640 |        |
| https://10.200.50.101:2379  | a1efacc681729fed |   3.5.0 |  5.5 MB |     false |      false |         3 |      14640 |              14640 |        |
| https://10.200.50.103:2379  | e419f44c92cc533a |   3.5.0 |  5.4 MB |     false |      false |         3 |      14640 |              14640 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

During interface disable leadership passed from 10.200.50.101 to 10.200.50.102

brandond commented 3 years ago

I see that you're using the ipsec flannel backend. Do you see the same issue with different a different backend?

brandond commented 3 years ago

Hmm, it should have exited after the Fatal error here, but for some reason it didn't. I'll have to see if I can figure out why.

Oct 06 14:24:06 somehostname1 k3s[2548]: E1006 14:24:06.907559    2548 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
Oct 06 14:24:06 somehostname1 k3s[2548]: I1006 14:24:06.907564    2548 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
Oct 06 14:24:06 somehostname1 k3s[2548]: E1006 14:24:06.907597    2548 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
Oct 06 14:24:06 somehostname1 k3s[2548]: I1006 14:24:06.907609    2548 deployment_controller.go:165] "Shutting down controller" controller="deployment"
Oct 06 14:24:06 somehostname1 k3s[2548]: F1006 14:24:06.907616    2548 controllermanager.go:287] leaderelection lost

Edit: I think this is because we were not unlocking the log mutex before panicking. This should be fixed by https://github.com/k3s-io/klog/commit/92394dadeec2079ad3102572fc8abf0a885cdae6

mjrist commented 3 years ago

I see that you're using the ipsec flannel backend. Do you see the same issue with different a different backend?

I tested with the default vxlan backend as well - same issue. (I noticed the ipsec related log 'CHILD_SA rekeying failed, trying again...' Seems to be unrelated since the issue of concern occurs with vxlan as well. Also, I don't always see that ipsec log.)

Hmm, it should have exited after the Fatal error here, but for some reason it didn't. I'll have to see if I can figure out why.

Is the exit by design? I noticed it does this in v1.21.4.

Here is a little more context which might be helpful here. We are trying to achieve higher availability and greater stability during failover scenarios. Won't go into all the details why, but suffice it to say we are constrained to use only 3 nodes in an HA setup and run the workload on the control plane nodes. So if I lose the etcd leader node, I need it to re-elect a new leader, report the node as NotReady as soon as possible, and reschedule my pods. To achieve this I've configured the grace periods, eviction timeouts, etc. Futhermore, when a node comes back online ideally we don't want it to cause disruptions.

I found that in etcd 3.4 (using k3s v1.21.4) the rejoining node forces an election which in turn cause a leaderelection loss and a k3s restarts on one or more nodes. In addition to degraded availability, this sometimes causes other nodes to temporarily switch into NotReady status. I also saw this happen often simply during node loss which causes the failover to exceed the expected duration.

We upgraded to k3s v1.22.2 to get etcd 3.5. The pre-vote feature prevents the disruption on rejoin. Overall, I have found etcd 3.5 to be much more stable - elections happen quickly, flaky nodes don't cause disruption, and overall failure works much better.

I wondered if the restarts on leader election loss in v1.21.4 were somehow "masking over" the issue - on node rejoin it always deposed the current leader and caused a k3s restart on at least one node. I assumed that this was no longer supposed to happen.

That said because the etcd 3.5 pre-vote feature ensures that leader election loss no longer happens on rejoin, I don't think it will hurt our case if k3s does restart. In this case, it would just be the failed node (that was the previous leader) that restarts.

brandond commented 3 years ago

K3s's general approach to failure handling is to exit so that we can be restarted by the init system (systemd) and ensure that everything comes up cleanly. This is partially intentional as it reduces how much error handling and correction we need to worry about, and partially unintentional as the upstream Kuberntes code calls klog.Fatal all over the place when something unexpected happens, and that in turn calls os.Exit which can't be trapped.

We recently made an effort to improve error handling by forking klog to call panic instead of printing a stack trace and exiting, which lets us shut down more cleanly. Unfortunately it looks like for some reason the Fatal log message is being printed, but the panic isn't occurring, which means that we don't shut down. I'll need to dig into that to figure out why.

We don't manage etcd's leader election stuff; we mostly just manage cluster membership and let etcd do what it will. If the cluster experiences latency problems on disk or between nodes, that can in turn cause the Kubernetes controller lease renewal to time out, which constitutes a fatal error as far as Kubernetes is concerned.

mjrist commented 3 years ago

Okay, the exit makes sense to me - thanks for explaining.

I ran some more tests and confirmed some things - which you may already know - I'm catching up :)

If I disconnect non-etcd-leader nodes (in which case there is no leader change and no "leaderelection lost"/fatal error on any nodes during failover or recovery) everything recovers fine.

The problem arises only when the fatal error occurs (leaderelection lost) on the node that went down. Furthermore, if I manually restart k3s before the node comes back online it recovers fine when it rejoins.

So all that just confirms what you are saying - fix the panic and restart and it should work.

By the way, thanks for your fast responses on this.

brandond commented 3 years ago

You might note that the etcd cluster leader and Kubernetes controller leaders are separate things. Since both etcd and Kubernetes come up together, they usually end up on the first node initially - but it's possible for one node to become the etcd leader, and another node to become leader for one or more of the Kubernetes controllers when that first node goes down and leadership is picked up by another node. Unlike etcd leadership which is voted on between members, Kubernetes controller leadership is taken over by whichever node tries first to take the lease when the previous node's hold on it expires. If k3s restarts quickly, it is possible for the old node to retain its hold on the controller leases even though the etcd leadership transferred elsewhere.

mjrist commented 3 years ago

That makes sense. This raises a few questions in my mind.

I've been using ectdctl to tell which node is the etcd leader. How do I tell which node has controller leadership?

Correct me if I'm wrong here: To acquire the controller lease I'm assuming etcd must be available. For etcd to be available there must be an etcd leader. So if the etcd leader and kubernetes controller leader happen to be the same node, and I lose that node, two things must happen to recover: first a new etcd leader must be elected then second a new node gains the controller lease.

On the other hand, if the etcd leader and controller leader are not the same node, and the etcd leader fails. Only a new etcd leader must be elected. Presumably, the failover is quicker in this scenario. Is the difference in failover time between the two scenarios appreciable?

If there is a significant difference, is there a way to ensure those responsibilities fall on different nodes? I'm guessing there isn't, and normally you wouldn't care (I think our use case is a little unique - if we weren't running workload on the control-plane we wouldn't care as much about these failover times).

brandond commented 3 years ago

The controllers aren't necessary for the apiserver to be available; the leases are essentially just mutexes to make sure that there is only a single copy of each core Kubernetes controller active in the cluster. The apiserver is mostly passive - at its core it is just a web server that translates the various API resource types to/from JSON stored in etcd. There can be many of them accepting writes at the same time.

All the real work of ensuring cluster state is handled by the controllers, which are clients of the apiserver, and for the most part there should only be one of each controller active at a time. The copies running on other nodes just sit there periodically polling the lease to see if they should take over. You can see where they are via kubectl get lease -A.

What failover are you talking about? What problem are you trying to solve? Keep in mind that Kubernetes is built around eventual consistency - if you're trying to get pods rescheduled within seconds when a node goes down, that's probably going to be quite difficult to achieve. For that sort of HA the 'Kubernetes native` way to handle it would be to distribute the work between multiple nodes, with an external load-balancer ensuring that traffic is only sent to healthy pods.

mjrist commented 3 years ago

Good explanation. Thanks.

What failover are you talking about? What problem are you trying to solve? Keep in mind that Kubernetes is built around eventual consistency - if you're trying to get pods rescheduled within seconds when a node goes down, that's probably going to be quite difficult to achieve. For that sort of HA the 'Kubernetes native` way to handle it would be to distribute the work between multiple nodes, with an external load-balancer ensuring that traffic is only sent to healthy pods.

By failover I'm talking about the time to detect node failure, and spin up the "lost" pods on a new node. You're absolutely right though - Kubernetes wasn't designed for rapid failover and it isn't going to happen in seconds. Even if it was designed for this, it still wouldn't give you 100% availability. Ultimately, we need to build redundancy into the app like you are saying - we're just not quite there yet, but we will get there eventually. For now I can tolerate around 60 seconds of downtime.

In any case, with v1.22.2+k3s1 (and etcd 3.5) it is good enough to meet our requirements for now. I'll just need the "failed" node to re-enter Ready status when it comes back online, which sounds like it will work by fixing the panic and restart.

mjrist commented 3 years ago

Quick question along the lines of my earlier question, but now with better understanding due to your explanations.

I have our node grace period and other settings configured such that during disconnect of a "non-controller-manager" node, it goes into NotReady status within approximately 30 seconds - this is great.

On the other hand, if I disconnect the controller-manager node, I wait a little over 15 seconds for the kube-controller-manager lease to be taken by another node, and then the 30 seconds for my node to go into NotReady status. Is there anyway to avoid the 15 second delay? (I know I can decrease the lease duration.) In my case, I would prefer if the grace period was not reset by the controller move (if in fact my understanding of this is correct and that is what is happening).

I thought perhaps this could be achieved via the node lease objects - however, all I could gather from the kubernetes documentation (https://kubernetes.io/docs/concepts/architecture/nodes/#heartbeats) is that the lease updates are independent from status and therefore would have no effect on entering the NotReady state and subsequent pod rescheduling. I don't really understand the relationship between the heartbeat methods and why they both exist.

I know this is not directly related to the "restart" bug under discussion. But I was hoping someone could point me in the right direction here. (Also, I know I'm trying to optimize something you wouldn't normally care about.)

brandond commented 3 years ago

if I disconnect the controller-manager node, I wait a little over 15 seconds for the kube-controller-manager lease to be taken by another node, and then the 30 seconds for my node to go into NotReady status. Is there anyway to avoid the 15 second delay?

I don't believe the controller failover resets the timer, as it's not actually a timer as far as I know. The kubelet updates the status on both its Node object, and the Lease object matching its Node name. That is what the controller looks at to determine whether or not the kubelet is healthy. This is documented at https://kubernetes.io/docs/concepts/architecture/nodes/#heartbeats

And no, I'm not aware of any way to shorten the takeover delay. I'm not sure if the controller lease duration is configurable, but lease updates are already one of the more chatty things that occur in a cluster at steady state so I wouldn't want to reduce it too much lower.

I think there's also some stuff in the works upstream for graceful controller handoff in the case of a planned shutdown, but that won't help you any for the failure case you're working on.

mjrist commented 3 years ago

Good information thanks.

Found this in my logs when the controller manger picks up on a new node during failover.

node_lifecycle_controller.go:1013] Missing timestamp for Node somehostname1. Assuming now as a timestamp.
node_lifecycle_controller.go:1013] Missing timestamp for Node somehostname2. Assuming now as a timestamp.
node_lifecycle_controller.go:1013] Missing timestamp for Node somehostname3. Assuming now as a timestamp.

and I gave node_lifecycle_controller.go a look

if nodeHealth == nil {
        klog.Warningf("Missing timestamp for Node %s. Assuming now as a timestamp.", node.Name)
        nodeHealth = &nodeHealthData{
            status:                   &node.Status,
            probeTimestamp:           nc.now(),
            readyTransitionTimestamp: nc.now(),
        }

My go is not greatest, but here is what I gathered from the rest of the code. The node controller just stores the probe times in memory (in the node heath objects in a map). On spin up it creates a new map and subsequently sets the last probe times to the current time, which of course has the effect of reseting t0 for the grace period.

I suppose it's cheaper this way than constantly persisting and sharing the last probe time to account for a controller handoff.

In any case, the failover times should be good enough for now.

brandond commented 3 years ago

Right, node health is nil on startup because the nodeHealthMap hasn't been populated yet. The timestamps will get updated almost immediately as the Node watch sees the initial nodeLister.List results come in via the code path in monitorNodeHealth. The Kubernetes code has a ton of parallel goroutines and as I said earlier, an eventual consistency model so not everything is in its final state at startup.

rancher-max commented 2 years ago

Performed the same steps mentioned in https://github.com/k3s-io/k3s/issues/4239#issuecomment-956933405 using master branch commit 702fe24afe3f08f96ebc167313c20c0339a5510f