Failover time very high in layer2 mode

ghaering commented 6 years ago

Is this a bug report or a feature request?:

Both, probably.

What happened:

I tested with layer 2 mode and simulated node failure by shutting down the node that the load balancer IP was on. What then happened it that it took approx. 5 minutes for the IP to be switched to the other node in the cluster (1 master, 2 nodes). After much experimentation I came to the conclusion that the node being down and being "NotReady" did not initiate the switch of the IP address. The 5 minute timeout seems to be caused by the default pod eviction timeout of Kubernetes, which is 5 minutes. That means it takes 5 minutes for a pod on a node that is not available to be deleted. Default "node monitor grace period is 40 seconds, btw.". So that means it currently takes almost 6 minutes with default confguration for an IP address to be switched.

I made things a lot better by decreasing both settings like this:

--pod-eviction-timeout=20s
--node-monitor-grace-period=20s in /etc/kubernetes/manifests/kube-controller-manager.yaml

This makes MetalLB switch the IP in case of node failure in the sub-minute range.

What you expected to happen:

To be honest what I would expect that the whole process takes maybe max. 5 seconds.

How to reproduce it (as minimally and precisely as possible):

Create a Kubernetes 1.11.1 cluster with kubeadm (single master, two nodes). Calico networking.

kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.2/manifests/metallb.yaml kubectl apply -f metallb-cfg.yml kubectl apply -f tutorial-2.yaml

➜ metallb-test cat metallb-cfg.yml apiVersion: v1 kind: ConfigMap metadata: namespace: metallb-system name: config data: config: | address-pools:

name: default protocol: layer2 addresses:
- 10.115.195.206-10.115.195.208

Then

watch curl --connect-timeout 1 http://10.115.195.206

to see if the nginx app is reachable.

Then

kubectl logs -f --namespace metallb-system speaker-xxxxxxxxx

To see which node has the IP address assigned at the moment. ssh into the machine and "poweroff".

Wait for how long it takes until the "watch curl" is successful again.

Anything else we need to know?:

Environment:

MetalLB version: v0.7.2
Kubernetes version: v1.11.1
BGP router type/version: N/A
OS (e.g. from /etc/os-release): CentOS 7
Kernel (e.g. uname -a):Linux cp-k8s-ghdev02-node-01.ewslab.eos.lcl 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

danderson commented 6 years ago

This is definitely a bug. In MetalLB 0.6, failover time was at max 10 seconds, because we had explicit leadership elections and so we could detect node failure much faster. In 0.7, for scaling, I switched to trusting k8s to tell us the state of the cluster. Unfortunately, I assumed that it was way better than it actually is :(

So, failover time for layer2 is definitely unacceptable in 0.7. Node failures should recover in seconds, not minutes. Now, how can we fix that?...

The obvious choice is to reintroduce leader election. But, to make 0.7's features work, now the leader election has to be per-service. That means a linearly growing amount of k8s control plane traffic for each LB service, more CPU consumption to manage the leadership state, and probably less good distribution of services across nodes because a leadership race is more open to racing.

For reference, if we set the leader election timeout to 10s, that means we should ping the object every ~5s to renew our lease, so that's 0.2qps of k8s control plane traffic for every LoadBalancer service in the cluster. That can get huge pretty quickly :/

Another option would be to maintain a separate "healthy speakers" object of some kind, where each speaker pings its liveness periodically. Then each speaker could still do stateless decisions for leadership, and just filter the list of potential pods based on which speakers are alive. This is still leader election, but now the qps cost is O(nodes) instead of O(services). This won't work either I think, many services have way more nodes than LB services.

Either way, we need to resolve this, because 0.7 has made the failover time of layer2 mode unacceptable.

danderson commented 6 years ago

Okay, one more possibility: make the speakers explicitly probe each other for health. We bypass the control plane completely, the speakers just probe each other in a mesh so that they know exactly who is healthy and who is not. Then they can use that information to filter the list of "feasible" pods when they're calculating who is the leader.

O(n^2) network traffic, but it's very light network traffic compared to the cost of talking to the k8s control plane. There's still the risk of split brain in this setup, because there could be a network partition that triggers different health visibility in different sections of the cluster.

danderson commented 6 years ago

Okay, one more idea: if the node transition from Ready to NotReady happens pretty quickly, we can just make all speakers monitor all nodes, and only allow pods on Ready nodes to be considered for leadership.

That's the quickest fix in terms of the current codebase, but from your description that will only lower the failover time to ~40s, which is still high (although better than 6min). @ghaering , do I understand correctly that it takes 40s before the node status becomes NotReady?

danderson commented 6 years ago

Oh, and one more idea: change the architecture of MetalLB, and make the controller responsible for selecting an announcing node for each service, instead of making the speakers do it. The controller has a more global view of the world, and it could healthcheck the speakers explicitly to help it make decisions. As a bonus, it would make it much easier to have a status page on the controller that shows the full state of the cluster... Possibly. Need to think about that, it's a huge change to MetalLB.

mxey commented 6 years ago

MetaLB 0.7 uses endpoint readiness, which is based on pod readiness. The Kubernetes node lifecycle controller marks the Pod as no longer ready if their Node becomes NotReady. Waiting for the pod eviction timeout (default 5 minutes) should not be necessary. I am not sure why that did not work. @ghaering maybe you can try increasing the logging of the kube-controller-manager to see how it reacts to node readiness.

I think the MetalLB behavior makes sense. It matches how kube-proxy works. MetalLB doing faster IP failover does not help much if the service endpoints are partly dead, but still in the loadbalacing pool.

If 40 seconds is too high, it's possible to tweak the node monitor periods. That increases the load on the API and etcd, though, but it's feasible. There is ongoing work to make the node heartbeats cheaper.

mxey commented 6 years ago

Oh, and one more idea: change the architecture of MetalLB, and make the controller responsible for selecting an announcing node for each service, instead of making the speakers do it. The controller has a more global view of the world, and it could healthcheck the speakers explicitly to help it make decisions.

This would mean you have to run multiple controllers with leader election, because otherwise if you lose the node the controller runs on, you have to wait 40s + 5m for the controller pod to be evicted.

As a bonus, it would make it much easier to have a status page on the controller that shows the full state of the cluster... Possibly.

I think that's better done by collecting the Prometheus metrics from all the speakers. I have set up a simple Grafana dashboard that shows me which node is advertising which service IP.

ghaering commented 6 years ago

@danderson Yes, it should per default take around 40 secs to detect unhealthy nodes. The relevant setting is https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/#options

--node-monitor-grace-period duration     Default: 40s
--
  | Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status.

That's also about the time I experienced.

ghaering commented 6 years ago

The MetalLB controller only runs once, and from what I saw yesterday it takes 5 minutes (!) per default until a pod in "unreachable node" state is deleted and scheduled on another node. That was the other setting I wrote above that I changed from default 5 minutes to 20 seconds (- --pod-eviction-timeout=20s). I don't think depending on Kubernetes itself is a promising approuch. OTOH implementing leader election yourself is a PITA.Could we require that there are initially at least three nodes and borrow/steal existing leader election code,like software like etcd must have implemented?

mxey commented 6 years ago

The MetalLB controller only runs once, and from what I saw yesterday it takes 5 minutes (!) per default until a pod in "unreachable node" state is deleted and scheduled on another node.

It's true that pod eviction by default starts after 5 minutes. But that does not affect the MetalLB speaker failover.

danderson commented 6 years ago

So, this is what I don't understand: in your original report, you said it took 5min for the IP failover to happen. But, given the design of MetalLB 0.7, the failover should happen after ~40s, when the failed node becomes unhealthy. Pod eviction takes longer, but MetalLB manages failover based on service endpoint health, not pod health. When the node becomes NotReady, the endpoint should move into notReadyAddresses, and that's when failover should happen.

Can you confirm that it took >5min for the failover to happen? If so, that could indicate a bug in the leader selection code, where it's not ignoring unready pods. That would be bad :).

If the failover does happen in 40 seconds... Then it's mostly "working as intended", because we're trusting k8s to tell us if the service is healthy. But, I still think 40s is quite long, and I'd like to think about ways to make the failover faster.

It would be much easier if I also controlled the kube-proxy dataplane layer, because then I could confidently implement my own health probing and fully reprogram the entire dataplane as needed. Unfortunately, by design I'm trying to cooperate with Kubernetes... And so we end up with this problem :/

ghaering commented 6 years ago

Ok, I rerun the test. Here are the results.

#!/usr/bin/python -u
import requests
import time

def test():
    try:
        response = requests.get("http://10.115.195.206", timeout=0.5)
    except:
        return False
    return True

if __name__ == "__main__":
    while 1:
        result = test()
        print time.time(), result
        time.sleep(1)

I started the script, prepared fetching the logs, then shut down the node where the load balancer IP pointed to.

Results of the HTTP test script

1534092121.95 True
...
1534092293.57 True
1534092294.68 True
1534092295.8 True
1534092296.92 True
1534092297.97 False
1534092299.48 False
...
1534092603.64 False
1534092605.14 False
1534092606.65 False
1534092607.76 True
1534092608.87 True
1534092610.08 True

metallb-test python -c "print 1534092607.76 - 1534092297.97" 309.789999962

ca. 300 seconds => 5 minutes `

I also have the speaker logs, but the timestamps are off compared to the test script (UTC vs. local time).

The node was unhealthy in the expected time, but failover took 5 minutes.

danderson commented 6 years ago

Thanks for the repro. That's definitely way too long.

I've looked at the code that does leader selection, and it seems to be correct: as soon as an endpoint transitions into notReadyAddresses, it gets removed from the eligible set and failover should take place.

I'm going to have to reproduce this locally on my cluster and dig into Kubernetes's behavior some more. Something is definitely wrong somewhere.

mxey commented 6 years ago

The speaker logs would be useful to see whether the speaker did not fail over or whether there was an issue with the ARP cache on the other side.

ghaering commented 6 years ago

Ok, I'll try to rerun the experiment and attach both speaker logs completey. I destroyed the cluster in the meantime.

ghaering commented 6 years ago

I'm attaching the original speaker log of the two pods. Please tell me if I can provide any further useful information. I can rerun the experiment anytime with new clusters that I can create.

speaker-l65vf.log speaker-kk248.log

burnechr commented 6 years ago

The 5 minutes seemed way longer than what I had seen in the past, so I did some testing myself as well. I was able to reproduce both the 5m failover results and the 40s result. Env is 2 nodes/minons, a 2 replica deployment.

Test 1: K8s has deployed 1 pod on each node (as expected). Node 1 is serving the MetalLB IP for the service. When I power off the node 1 in this test, it takes ~40s (40.1s) for the MetalLB IP to failover from node 1 to node 2, and thus the traffic starts to succeed.

Test 2: <about 10 minutes later> As an artifact of the previous test case, both pods of my deployment are on node 2, as well as the MetalLB IP. I bring up node 1, traffic is still hitting node 2. I fail node 2, and does take ~5m (306s) for the cluster to move the pods from node 2, to node 1 and traffic to start succeeding.

So I am in agreement with @danderson that the 40s is too long, and would be great if we can come up with a faster way to initiate the failover. To the 5m report, in my testing the 5m seems to be when there is either a single replica or all replicas are on the same node when the failure test is introduced. Can @ghaering confirm, or tell me I am way off base and missed something in my testing (my feelings wouldn't be hurt lol ).

mxey commented 6 years ago

To the 5m report, in my testing the 5m seems to be when there is either a single replica or all replicas are on the same node when the failure test is introduced.

That makes sense. Since there are no endpoints on node 1, MetalLB will not move the IP over to node 1. MetalLB < 0.7 would make moved the IP to another node, but there still would not have been any available endpoints for the IP for another 5 minutes.

danderson commented 6 years ago

Aah. Okay, that does make sense. If we lose the node with all pods on it, the result is that all endpoints become unready, and there's nowhere to fail over to until k8s reschedules the pods - which happens only after 5min once it finally completely times out the node.

So, the lesson I'm learning from that is: the documentation should emphasize node anti-affinities more, to make sure that replicas are spread across nodes to avoid this class of failures.

Separately, yes, 40s is still too long, and I don't want to require reconfiguration of core k8s control plane settings to make MetalLB viable. So, we still need something better.

I'm not happy with any of the options I suggested so far, all of them are either very expensive, or complex (== bugs), or both. If anyone has more ideas, I'd love to hear them.

ghaering commented 6 years ago

@burnechr In my test case I used i single-replica nginx deployment that was exposed through MetalLB. Does that answer your question?

danderson commented 6 years ago

Yeah, if you have a single replica, what you're seeing is mostly the fact that Kubernetes takes a long time to reschedule workloads on dead nodes. If you switch to 2 replicas, and use node anti-affinities to put the pods on different nodes, failover should happen in ~40s. Still too long, but 10x better :)

johnarok commented 5 years ago

I observed the 5 minute failover time even though the pods were distributed across 2 nodes. When the node goes to Not-Ready(introduced init 6), it takes 5 minutes for the IP to move over. is that expected?

juliohm1978 commented 5 years ago

@johnarok

That is the experience I had with metallb. In my case, even with a failover of 5 seconds would not be acceptable. Our current keepalived setup provides that in less than a second.

The failover delay is the main reason we haven't switched to metallb.

omenv commented 5 years ago

@johnarok @juliohm1978 I have the same situation when a node goes down it takes around 20 seconds for failover. So we have to use keepalived because it provides us a more faster failover.

juliohm1978 commented 5 years ago

Dropping by to keep this thread alive. Any progress on this?

I have great interest in using metallb in our baremetal k8s, but the failover is the only requirement holding us to keepalived.

Would it make sense to implement a highly available controller with leader election independent from the k8s api? In my head, making metallb dependent on k8s to check for service readiness does not amount to a robust load balance solution. The assumption is that there is no need to expose a public IP if the service does not have any ready endpoints.

In my experience, that is mostly an edge case where your entire deployment of pods suddenly became unavailable. For the everyday load balance requirement, the service will have a number of pods always available while some nodes/endpoints come and go randomly.

As I understand, metallb should work independently and rely mostly on node availability. Service endpoints should be a secondary concern. The idea of a virtual IP floating around cluster nodes comes from the need to keep this IP highly available and resilient to node failure.

andrelop-zz commented 5 years ago

@johnarok

That is the experience I had with metallb. In my case, even with a failover of 5 seconds would not be acceptable. Our current keepalived setup provides that in less than a second.

The failover delay is the main reason we haven't switched to metallb.

Do you happen to still use keepalived ?

We are considering using it as we are having the same issue (very high faliover time) in layer2 mode using MetalLB.

We found the keepalived-vip project, but it seems to be retired. Are you using some private solution or something like keepalived-vip ?

Could you please share some more details ? @omenv do you also have some more details to share regarding your usage of keepalived ?

juliohm1978 commented 5 years ago

@andrelop

I developed easy-keepalived as a prototype a few months back.

My team adopted it as a baseline. Internally, we evolved it into a python implementation and a more recent keepalived version. The only drawback to consider in terms of security is that, to provide public external IPs to users outside a Kubernetes cluster, you will need to run the deployment using the host network.

The idea for easy-keepalived is that you can use a number of nodes in your cluster to keep a group of virtual IPs highly available to the outside world. It uses a simplified yaml file to configure a keepalived cluster for a full fail over and load balance. We use that with nginx-ingress-controller also running bound to the host newtork.

Feel free to fork that use as a starting point.

Raboo commented 5 years ago

So as a current work-around, what k8s control plane settings do one need to tune in order to get the 40s down to like 5?

andrelop-zz commented 5 years ago

Hello,

So as a current work-around, what k8s control plane settings do one need to tune in order to get the 40s down to like 5?

I don't think 5s is currently possible. At https://metallb.universe.tf/concepts/layer2/ we can read :

If the leader node fails for some reason, failover is automatic: the old leader’s lease times out after 10 seconds, at which point another node becomes the leader and takes over ownership of the service IP.

So I think 10s is currently the minimum amount of time that could be expected for the failover to happen. I was able to achieve 10~11s by also setting controller-manager's --node-monitor-grace-period.

As failover time was critical for me I ended up using easy-keepalived from @juliohm1978. I was already familiar with keepalived as we use it internally for a bunch of other services (non-k8s managed/related services).

teralype commented 5 years ago

Based on some experiments in my Kubernetes cluster (CoreOS, Kubespray), fail-over time when the elected leader is drained and shut down is never less than 320 seconds.

My experiment consists of:

Identify the Kubernetes node where the elected speaker pod is running. At the moment, the only way I know of doing this is ping a VIP managed by MetalLB, look up the MAC address that corresponds to such VIP, and then determine which Kubernetes node owns that MAC address. Let's call this Kubernetes node N.
From a machine external to the Kubernetes cluster, run ping -a VIP.
Using kubectl drain N, I drain the node, then shutdown it.
From that point on, it takes usually around 324 or 326 seconds for the VIP to fail-over to a different Kubernetes node. To disregard the option of ARP cache problems, I did an arp -d VIP on the external machine. In this case, fail-over time again is never lower than 320 seconds.

Is this really a normal behavior? Is there anything I can do to improve this ~5m to something closer to ~1m perhaps?

jenciso commented 4 years ago

Hi guys

I confirmed this behavior and it still appear in the latest version (v0.8.3). Instead to change the controller-manager configuration to reduce the time for the pod-eviction (- --pod-eviction-timeout=2s), I used the taint based evictions[1] option in my specific deployment. E.g.

    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2

With this approach, you don't need to change anything into the contoller-manager configuration and the failover time was reduce to 40s, It is very high still. Please, I would like to suggest to use another method to failover (like a v0.6) and don't use kubernetes endpoint to monitor the service. Failover should be faster like keepalived method

[1] https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions

teralype commented 4 years ago

@jenciso but having to add tolerations to all of your deployments is something that can be very tedious. Unless someone has Istio deployed which seems to have support for manipulating tolerations since https://github.com/istio/istio/pull/13044. In any case, a little bit impractical for small customers.

@danderson any plans on getting this fixed for ARP/L2? Having to switch to BGP is not an option for us right now, mostly because our Calico CNI already peers with the upstream router. I tried an experiment which consists of peering MetalLB with Calico and Calico with the upstream routers, but this is not fully supported yet.

champtar commented 4 years ago

@danderson since k8s 1.14 there is the Lease API, which is basically a cheap leader election, maybe with that we could go back at 10s max failover that we had in 0.6 ?

danderson commented 4 years ago

@champtar Interesting! Leases seem very undocumented, the only thing I can find about them is that kubelet uses them. Do you know if there's more documentation on what makes them more lightweight? In particular, I'm mostly worried about the write load on etcd, so these leases would have to be effectively in-memory objects, and that raises a bunch of questions around consistency. So I'm curious to read more.

champtar commented 4 years ago

Sorry, I assume they are cheaper but maybe not that much. I tried to find some docs and performance number but I failed, so maybe it's just a clean generic API that also allow leader election pattern, but still make etcd write on disk. To be continued ...

champtar commented 4 years ago

https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0009-node-heartbeat.md

champtar commented 4 years ago

Leader election exemple is updated to use it: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go

danderson commented 4 years ago

Based on the design doc, afaict it's cheaper mainly because the object is very small, so it minimizes the impact on etcd in terms of disk I/O. That's still relatively expensive in absolute terms (because talking to k8s in general is pretty expensive), but probably not too bad. It'll suck if you have 1000 services, just like it'll suck if you have 1000 nodes... But that's probably okay.

Longer term I would like to move all node selection into the controller and use something like Serf that can detect node failure in <1s, but unless that happens soon, I think using Lease is probably a good incremental step.

champtar commented 4 years ago

Serf seems nice

danderson commented 4 years ago

Yes, but a lot more upfront work. I think we still want to do it someday, but using Leases could be a quick way ot drop failover time back to "a few seconds" (vs. "minutes"). We can always make it even better later.

champtar commented 4 years ago

Or instead of serf just https://github.com/hashicorp/memberlist for the failure detection, and keep everything else the same (if I understand right now the number of queries to the API is minimal)

danderson commented 4 years ago

AFAIK that's the underlying library that powers Serf. I definitely mean something in-process in the MetalLB Go binaries, not running a separate Serf cluster :).

Is there any "serf as a library" that's something different to memberlist?

champtar commented 4 years ago

When you talked about Serf, I though you also wanted to exchange information via the Gossip protocol, if we only use it for failure detection it should be less complicated (but I haven't looked at the code ;) )

danderson commented 4 years ago

I haven't thought about exchanging information, I can't think what to exchange :) Yes, you're right, I think we only care about membership (specifically: join/leave of nodes).

riking commented 4 years ago

@teralype you need to kubectl delete node after running shutdown to properly tell k8s the node is intentionally gone

metallb / metallb

Failover time very high in layer2 mode #298

Results of the HTTP test script