ELB IP changes can bring the cluster down

danielfm commented 7 years ago

I ran into https://github.com/kubernetes/kubernetes/issues/41916 twice in the last 3 days in my production cluster, with almost 50% of worker nodes transitioning to NotReady state almost simultaneously in both days, causing a brief downtime in critical services due to Kubernetes default (and agressive) eviction policy for failing nodes.

I just contacted AWS support to validate the hypothesis of the ELB changing IPs at the time of both incidents, and the answer was yes.

My configuration (multi-node control plane with ELB) matches exactly the one in that issue, and probably most kube-aws users are subject to this.

Have anyone else ran into this at some point?

mumoshu commented 7 years ago

@danielfm Thanks for the report.

I think I have never encountered the issue myself for my production cluster but anyway, I believe this is a BIG issue.

For me, the issue seems to be composed of two parts: one is a long TTL other than ELB's(for example a CNAME record's TTL in kube-aws, which points to ELB's DNS name) and the another is an issue in ELB and/or kubelet's which prevents kubelet to detect broken connection to apiservers.

Is my assumption correct?

Anyway, once recordSetTTL in cluster.yaml is set to considerably lower than 60s(=ELB's default TTL), kubelet should detect any dead long-polling connection to one of API servers (via an ELB CNAME -> ELB instances) and then re-discover ELB instances via DNS and reconnect to one of them.

However I'm happy to work-around the issue in kube-aws.

Possible work-arounds

Perhaps:

periodically monitoring the list of ips an route53 alias record for an ELB and then
restarting kubelet whenever the monitor detects the list to be changed

so that kubelet can reconnect to one of active ELB ips before it marks the k8s node NotReady?

If we go that way, the monitor should be executed periodically with an interval of (<node-monitor-grace-period> - <amount of time required for kubelet to start> - <amount of time required for each monitor run>) / <a heuristic value: N> - <amount of time required to report node status after kubelet started> at least, so that we can provide kubelet at most N chances to successfully report its status to possibly available ELB.

Or we can implement a DNS round robin with a health-checking mechanism for serving k8s API endpoints like suggested and described in #281 #373

mumoshu commented 7 years ago

This hash changes only when backend IPs of an ELB has been changed.

$ dig my.k8s.cluster.example.com +noall +answer +short | sort | sha256sum | awk '{print $1}'
96ff696e9a0845dd56b1764dd3cfaa1426cdfbd7510bab64983e97268f8e0bc4

$ dig my.k8s.cluster.example.com +noall +answer +short | sort
52.192.165.96
52.199.129.34
<elb id>.ap-northeast-1.elb.amazonaws.com.

mumoshu commented 7 years ago

@danielfm I guess setting recordSetTTL in cluster.yaml(which sets the TTL of a CNAME record associated to a controller ELB) to be a value considerably lower than that of ELB's DNS A records TTL(=60sec) and kubelet's --node-monitor-grace-period and running the following service would alleviate the issue.

/etc/systemd/system/elb-ip-change-watcher.service:

[Service]
Type=notify
Restart=on-failure
Environment=API_ENDPOINT_DNS_NAME={{.APIEndpoint.DNSName}}
ExecStart=/opt/bin/elb-ip-change-watcher

/opt/bin/elb-ip-change-watcher:

#!/usr/bin/env bash

set -vxe

current_elb_backends_version() {
  dig ${API_ENDPOINT_DNS_NAME:?Missing required env var API_ENDPOINT_DNS_NAME} +noall +answer +short | \
    # take into account only ips even if dig returned a CNAME answer(when API_ENDPOINT_DNS_NAME is a CNAME rather than an A(or Route 53's "Alias") record
    grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | \
    # sort IPs so that DNS round-robin doesn't unexpectedly trigger a kubelet restart
    sort | \
    sha256sum | \
    # sha256sum returns outputs like "<sha256 hash value> -". We only need the hash value excluding the trailing hyphen
    awk '{print $1}'
}

run_once() {
  local file=$ELB_BACKENDS_VERSION_FILE
  prev_ver=$(cat $file || echo)
  current_ver=$(current_elb_backends_version)
  echo comparing the previous version "$prev_ver" and the current version "$current_ver"
  if [ "$prev_ver" == "" -o "$prev_ver" == "$current_ver" ]; then
    echo the version has not changed. no need to restart kubelet.
    if [ "$KUBELET_RESTART_STRATEGY" == "watchdog" ]; then
      echo "notifying kubelet's watchdog not to trigger a restart of kubelet..."
      local kubelet_pid
      kubelet_pid=$(systemctl show $KUBELET_SYSTEMD_UNIT_NAME -p MainPID | cut -d '=' -f 2)
      systemd-notify --pid=$kubelet_pid WATCHDOG=1
    fi
  else
    echo the version has been changed. need to restart kubelet.
    if [ "$KUBELET_RESTART_STRATEGY" == "systemctl" ]; then
      systemctl restart $KUBELET_SYSTEMD_UNIT_NAME
    fi
  fi
  echo writing $current_ver to $file
  echo "$current_ver" > $file
}

ELB_BACKENDS_VERSION_FILE=${ELB_BACKENDS_VERSION_FILE:-/var/run/coreos/elb-backends-version}
KUBELET_SYSTEMD_UNIT_NAME=${KUBELET_SYSTEMD_UNIT_NAME:-kubelet.service}
KUBELET_RESTART_STRATEGY=${KUBELET_RESTART_STRATEGY:-systemctl}
WATCH_INTERVAL_SEC=${WATCH_INTERVAL_SEC:-3}

systemd-notify --ready
while true; do
  systemd-notify --status "determining if there're changes in elb ips"
  run_once
  systemd-notify --status "sleeping for $WATCH_INTERVAL_SEC seconds"
  sleep $WATCH_INTERVAL_SEC
done

mumoshu commented 7 years ago

Also - I have never realized that but we're creating CNAME records - which has its own TTL(=300 seconds by default in kube-aws) other than ELB's A records' TTL - for controller ELBs. https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/stack-template.json#L778 We'd better make them Route53 Alias records so that there will be only one TTL.

mumoshu commented 7 years ago

Updated my assumption on the issue in my first comment https://github.com/kubernetes-incubator/kube-aws/issues/598#issuecomment-297586503

mumoshu commented 7 years ago

Hmm, a TTL for a CNAME record associated to ELB's DNS name seems to be capped at 60s, even though kube-aws' default recordSetTTL is set to 300s. Then, we won't need to care about the balance of two TTLs(kube-aws CNAME and ELB's A) that much.

mumoshu commented 7 years ago

Anyway, to forcibly restart kubelet to reconnect apiserver(s) when necessary(=elb ips changed) would still be important. The script and the systemd unit I've suggested in https://github.com/kubernetes-incubator/kube-aws/issues/598#issuecomment-297615293 would be useful to achieve it.

redbaron commented 7 years ago

@mumoshu does it mean that going ELB-less mode with just CNAME record in Route53 pointing to controller nodes is considered dangerous now?

mumoshu commented 7 years ago

@redbaron Do you mean a single CNAME record for your k8s API endpoint which points to just one controller node's DNS name?

I've been assuming that you would have a DNS name associated to one or more A records(rather than CNAME) each is associated to one of controller nodes public/private IP if you'd go without an ELB.

redbaron commented 7 years ago

Doesn't change the fact that final set of A records returned from DNS request change when you trigger controllers ASG update, right? From what I read here it matches the case when ELBs change their IPs

mumoshu commented 7 years ago

@redbaron Ah, so you have a CNAME record which points to a DNS record containing multiple A records each is associated to one of controller nodes, right?

If so, no, as long as you have low TTLs for your DNS records.

AFAIK, someone in the upstream issue said that it becomes an issue only when ELB is involved. Perhaps ELB doesn't immediately shutdown an unnecessary instance and doesn't send FIN? If that's the case, kubelet would be unable to detect broken connections immediately.

When your controllers ASG is updated, old, unnecessary nodes would be terminated before it becomes non-functional like ELB's instances.

So your ELB-less mode with CNAME+A records would be safe as long as you have health checks to update the route 53 record set to eventually return A records "only for healthy controller nodes", and you have a TTL lower than --node-monitor-grace-period

redbaron commented 7 years ago

ELB-less mode I refer to is a recent feature in kube-aws, I didn't check which records it creates exactly, just wanted to verify that is still a safe option to do considering this bug report.

mumoshu commented 7 years ago

@redbaron If you're referring to the DNS round-robin for API endpoints, it isn't implemented yet.

redbaron commented 7 years ago

I wonder if using ALB can help here. L7 load balancer precisely knows all incoming/outgoing requests and can forcibly yet safely close HTTP connection when ELB scales/changes IP addresses.

mumoshu commented 7 years ago

@redbaron Thanks, I never realized that ALBs may help. I wish they can help, too. Although I'd like to hear from contributors in the upstream issue to kindly clarify whether ALBs help or not, shall we take this opportunity to add support for ALBs into kube-aws anyway? 😃

camilb commented 7 years ago

Today happened to me for the first time with a 128 days old cluster.

danielfm commented 7 years ago

Seems like Amazon is doing some serious maintenance work in the ELB infrastructure these days...

mumoshu commented 7 years ago

@danielfm @camilb @redbaron For me it seems like there are still no obvious fixes other than:

Adding a dirty, periodic elb monitoring and kubelet restarting script of mine
Moving to Route 53 DNS round-robin based api endpoints (suggested by @tarvip in https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-293243276 and #281)
- A major challenge to include this into kube-aws is how to keep Route 53 records sets to be up-to-date with the running controller nodes at the time. Rolling-updates of controller nodes, sudden termination of controller nodes, unresponsive apiserver containers, etc., should ideally be considered while syncing record sets.

Would you like to proceed with any of them or any other idea(s)?

tarvip commented 7 years ago

Rolling-updates of controller nodes, sudden termination of controller nodes, unresponsive apiserver containers, etc

I think sudden termination, unresponsive apiserver and etc shouldn't cause problems as long as these events are not happening at the same time, but in this case even ELB setup fails.

But keeping route53 record up-to-date is a bit more complicated without using Lambda and SNS. When adding/restarting or replacing(terminating and recreating) controller nodes we could have unit in cloud-config that is executed when host comes up, this script gets all controller nodes that are part of ASG and modifies route53 record accordingly. But I don't know how to handle node removal (without adding new node back, although I think this is not happening very often).

Also I think it shouldn't be cause problems if route53 record is not fully up-to-date when restarting or replacing controller nodes, as long they are not replaced at the same time there should be at least one working controller.

mumoshu commented 7 years ago

Thanks @tarvip!

think sudden termination, unresponsive apiserver and etc shouldn't cause problems as long as these events are not happening at the same time, but in this case even ELB setup fails.

Sorry if I wasn't clear enough but the above concerns are not specific to this problem but more general. I just didn't want to introduce a new issue due to missing health checks / constant updates to route 53 record sets.

mumoshu commented 7 years ago

Also I think it shouldn't be cause problems if route53 record is not fully up-to-date when restarting or replacing controller nodes, as long they are not replaced at the same time there should be at least one working controller.

Probably that's true - then, my question is if everyone is ok with e.g. 50% k8s api error rate persists for several minutes when one of two controller nodes are being replaced?

mumoshu commented 7 years ago

But I don't know how to handle node removal (without adding new node back, although I think this is not happening very often).

If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?

redbaron commented 7 years ago

@mumoshu , is ALB known not to help here?

tarvip commented 7 years ago

I just didn't want to introduce a new issue due to missing health checks / constant updates to route 53 record sets.

I think missing health check is ok, that health check is just TCP check anyway, kubelet and kube-proxy can also detect connection failure and recreate connection to another host. Regarding constant updates, I guess we can perform update only if current state is different compared to route53.

mumoshu commented 7 years ago

Thanks @redbaron - No but it isn't known to help either.

tarvip commented 7 years ago

If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?

Yes, that is one way to solve it.

mumoshu commented 7 years ago

@tarvip Thanks for the reply!

I think missing health check is ok, that health check is just TCP check anyway, kubelet and kube-proxy can also detect connection failure and recreate connection to another host.

Ah, makes sense to me! At least it seems to worth trying for me now - thanks for the clarification.

mumoshu commented 7 years ago

@tarvip

Regarding constant updates, I guess we can perform update only if current state is different compared to route53.

Agreed

redbaron commented 7 years ago

As a bandaid, can't we query kubelet itself to cause it to go to apiserver for a response? If such query exist, then running it on a timer would let kubelet to detect problems sooner

mumoshu commented 7 years ago

As a bandaid, can't we query kubelet itself to cause it to go to apiserver for a response? If such query exist, then running it on a timer would let kubelet to detect problems sooner

Maybe my example is not very good but like etcdctl endpoint health for etcd? If so, that sounds good to me. Unfortunately I'm not aware of any concrete function of kubelet available for that.

tarvip commented 7 years ago

50% k8s api error rate persists for several minutes when one of two controller nodes are being replaced?

I think this affects only other clients like kubectl, not kube-proxy and kubelet. Kubelet and kube-proxy detect that one api server is down and recreate connection to other controller and use this connection (and controller) until something happens again with that connection, I tested this some time ago and it worked well, AFAIK they are not even refreshing their connections.

danielfm commented 7 years ago

If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?

I think using Lambda is not a solution since it's not available on every AWS region.

danielfm commented 7 years ago

@mumoshu I'm using your periodic elb monitoring and kubelet restarting script for the time being.

mumoshu commented 7 years ago

@tarvip

I think this affects only other clients like kubectl, not kube-proxy and kubelet. Kubelet and kube-proxy detect that one api server is down and recreate connection to other controller and use this connection (and controller) until something happens again with that connection

Thanks again for the clarification! It did make sense to me. Then, assuming we went ahead with the systemd timer method, there won't be huge problems as long as we set its interval short enough(could be several seconds at minimum)

mumoshu commented 7 years ago

Thanks @danielfm,

I think using Lambda is not a solution since it's not available on every AWS region.

Good point 👍

tarvip commented 7 years ago

there won't be huge problems as long as we set its interval short enough

I think in normal situation it can be even longer, lets say once in a minute, but for updating cloudformation it must be shorter as replacing controller nodes can be quite fast.

Then there is question where to run this timer, should all controller nodes execute this timer? If using 2 or 3 nodes and also short interval for timer then hopefully it wont create additional problems if they all try to update route53 record at the same time.

mumoshu commented 7 years ago

@tarvip

Then there is question where to run this timer, should all controller nodes execute this timer?

Yes, in the simplest case, I believe so.

IMHO we can even introduce some kind of leader election so that only one controller node is responsible to update record sets. However it temps me to make the record sets updater script to be a dockerized golang app which implements leader election using k8s client like cluster-autoscaler and other k8s products do 😃

mumoshu commented 7 years ago

Hi @tarvip @danielfm,

Regarding the missing health-check thing, just a quick thought but can we deploy a nginx-ingress controller by default to load-balance against healthy apiserver pods only? nginx-ingress(or even the vanilla open-source edition of Nginx) doesn't support active-monitoring but does seem to support passive-monitoring:

to remove a failed backend from the list of backends for a while. It isn't a complete solution but better than nothing?

tarvip commented 7 years ago

If it works then why not, but then there is the chicken and egg situation when bootstrapping cluster, kubelet and kube-proxy cannot connect to api as there is no ingress controller yet, but you cannot deploy ingress controller because there are no worker nodes available.

mumoshu commented 7 years ago

@tarvip Thanks! I was blindly thinking about deploying an ingress-controller as a static pod per controllre node but obviously it doesn't work as we have no way to provid an ingress resource for it 😢

roydq commented 7 years ago

I ran into this issue a few weeks ago myself. Changed the CNAME TTL at the time but it appears that isn't really a fix. If I'm not mistaken, it seems like the check/restart script is probably the most reliable solution for the time being. Based on the upstream issue, kubelet is keeping the connection open for up to 15 mins because it didn't receive any data telling it to close. When the ELB instances are terminated and the old IPs are no longer available, it seems that kubelet never receives any data like it normally would if your controller is shut down, the controller manager process was killed, etc..

DNS round robin would probably fix most cases since it gets rid of ELB which wasn't sending any data when the connection is closed... But I imagine there is still some possibility that other unexpected connection issues could result in kubelet just sitting there thinking nothing happened. Seems like a lot of work to implement in kube-aws, and doesn't change the fact that kubelet needs some improvement :).

mumoshu commented 7 years ago

FYI ALB doesn't seem to fit as a k8s API load balancer. Please see #608 and the corresponding PR.

So, we would need to settle on some of solutions other than ALB anyway.

mumoshu commented 7 years ago

Hi @roybotnik, thanks for sharing the hard won experience 👍 Good to know the CNAME TTL fix is missing the point.

Seems like a lot of work to implement in kube-aws

Could be 😄

mumoshu commented 7 years ago

@danielfm @redbaron @camilb @tarvip @roybotnik

In the meantime, would you agree that we should do the followings in order of priority?

Enable the check/restart script for the K8S API ELB
- As the script is considered the most effective (though dirty) solution could be introduced without changing things too much
Implement the DNS round-robin based K8S API endpoint to see how it works in the wild - including the longer-term reliability when integrated with kubelets
Revert 1 once it turned out the issue is completely fixable outside of kube-aws - maybe in kubelet

tarvip commented 7 years ago

I agree, although I'm not using this script at the moment. At the moment we have increased only --node-monitor-grace-period=90s in kube-controller-manager, I don't like it, as it is not fast enough for detecting if node is really down, but I thought if it happens again I will set up DNS round-robin, but we haven't seen this issue since February.

Also maybe that script should restart also kube-proxy.

roydq commented 7 years ago

I was thinking about this yesterday, and a simple solution that works around the kubelet connection timeout issue (without caring about ELB specifically) might be to query the API server itself for the node's status.

Since kube-controller-manager won't evict pods until a node has been offline for a while by default (5 minutes I believe), we can query every 60s or so to see if the node status is not ready and then restart kubelet. I haven't tested this fully and the script was kinda thrown together so it could probably be simplified (not great w/ bash script):

cloud-config-worker

    - name: node-status-check.service
      enable: true
      command: start
      runtime: true
      content: |
        [Unit]
        Description=Restart kubelet if node is not ready

        [Service]
        Environment=API_ENDPOINT_DNS_NAME={{.APIEndpoint.DNSName}}
        ExecStart=/opt/bin/node-status-check
        Restart=always
        RestartSec=60

/opt/bin/node-status-check

#!/usr/bin/env bash

# Get node info from API server
NODE_INFO=$(curl https://$API_ENDPOINT_DNS_NAME/api/v1/nodes/$HOSTNAME --key /etc/kubernetes/ssl/worker-key.pem --cert /etc/kubernetes/ssl/worker.pem --cacert /etc/kubernetes/ssl/ca.pem --cert-type PEM -s)
if [ $? -ne 0 ]; then
  echo "Unable to query API endpoint. Verify API endpoint is online and accessible to nodes."
  exit 1;
fi

# Select node conditions from json response
NODE_CONDITIONS=$(jq '.status.conditions[] | {(.type): .status}' -M -c <<< "$NODE_INFO")
if [ $? -ne 0 ]; then
  echo "API returned an invalid JSON response, unable to check status."
  exit 1;
fi

echo "Current node conditions:"
echo $NODE_CONDITIONS

if grep -q "\\\"Ready\\\":\\\"True\\\"" <<< $NODE_CONDITIONS
then
  echo "API server reports node is ready, no actions taken"
else
  echo "API server reports node status not ready, restarting kubelet"
  systemctl restart kubelet.service
fi

mumoshu commented 7 years ago

Hi @danielfm @redbaron @camilb @tarvip @roybotnik, have you encountered this issue since then? I still had no luck(??) to encounter this issue for my clusters and therefore am unable to decide this as resolved or not.

redbaron commented 7 years ago

Didn't happen to us yet, but upstream issue has a clean way to reproduce problem without ELB or any cloud provide whatsoever.

tarvip commented 7 years ago

We haven't seen this since initial bug report, but we also increased node-monitor-grace-period, maybe it helps.

camilb commented 7 years ago

Hi @mumoshu didn't happen to me since then and only happened once.

kubernetes-retired / kube-aws

ELB IP changes can bring the cluster down #598

Possible work-arounds