Closed danielfm closed 5 years ago
@danielfm Thanks for the report.
I think I have never encountered the issue myself for my production cluster but anyway, I believe this is a BIG issue.
For me, the issue seems to be composed of two parts: one is a long TTL other than ELB's(for example a CNAME record's TTL in kube-aws, which points to ELB's DNS name) and the another is an issue in ELB and/or kubelet's which prevents kubelet to detect broken connection to apiservers.
Is my assumption correct?
Anyway, once recordSetTTL
in cluster.yaml is set to considerably lower than 60s(=ELB's default TTL), kubelet should detect any dead long-polling connection to one of API servers (via an ELB CNAME -> ELB instances) and then re-discover ELB instances via DNS and reconnect to one of them.
However I'm happy to work-around the issue in kube-aws.
Perhaps:
so that kubelet can reconnect to one of active ELB ips before it marks the k8s node NotReady
?
If we go that way, the monitor should be executed periodically with an interval of (<node-monitor-grace-period> - <amount of time required for kubelet to start> - <amount of time required for each monitor run>) / <a heuristic value: N> - <amount of time required to report node status after kubelet started>
at least, so that we can provide kubelet at most N chances to successfully report its status to possibly available ELB.
Or we can implement a DNS round robin with a health-checking mechanism for serving k8s API endpoints like suggested and described in #281 #373
This hash changes only when backend IPs of an ELB has been changed.
$ dig my.k8s.cluster.example.com +noall +answer +short | sort | sha256sum | awk '{print $1}'
96ff696e9a0845dd56b1764dd3cfaa1426cdfbd7510bab64983e97268f8e0bc4
$ dig my.k8s.cluster.example.com +noall +answer +short | sort
52.192.165.96
52.199.129.34
<elb id>.ap-northeast-1.elb.amazonaws.com.
@danielfm I guess setting recordSetTTL
in cluster.yaml(which sets the TTL of a CNAME record associated to a controller ELB) to be a value considerably lower than that of ELB's DNS A records TTL(=60sec) and kubelet's --node-monitor-grace-period
and running the following service would alleviate the issue.
/etc/systemd/system/elb-ip-change-watcher.service
:
[Service]
Type=notify
Restart=on-failure
Environment=API_ENDPOINT_DNS_NAME={{.APIEndpoint.DNSName}}
ExecStart=/opt/bin/elb-ip-change-watcher
/opt/bin/elb-ip-change-watcher
:
#!/usr/bin/env bash
set -vxe
current_elb_backends_version() {
dig ${API_ENDPOINT_DNS_NAME:?Missing required env var API_ENDPOINT_DNS_NAME} +noall +answer +short | \
# take into account only ips even if dig returned a CNAME answer(when API_ENDPOINT_DNS_NAME is a CNAME rather than an A(or Route 53's "Alias") record
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | \
# sort IPs so that DNS round-robin doesn't unexpectedly trigger a kubelet restart
sort | \
sha256sum | \
# sha256sum returns outputs like "<sha256 hash value> -". We only need the hash value excluding the trailing hyphen
awk '{print $1}'
}
run_once() {
local file=$ELB_BACKENDS_VERSION_FILE
prev_ver=$(cat $file || echo)
current_ver=$(current_elb_backends_version)
echo comparing the previous version "$prev_ver" and the current version "$current_ver"
if [ "$prev_ver" == "" -o "$prev_ver" == "$current_ver" ]; then
echo the version has not changed. no need to restart kubelet.
if [ "$KUBELET_RESTART_STRATEGY" == "watchdog" ]; then
echo "notifying kubelet's watchdog not to trigger a restart of kubelet..."
local kubelet_pid
kubelet_pid=$(systemctl show $KUBELET_SYSTEMD_UNIT_NAME -p MainPID | cut -d '=' -f 2)
systemd-notify --pid=$kubelet_pid WATCHDOG=1
fi
else
echo the version has been changed. need to restart kubelet.
if [ "$KUBELET_RESTART_STRATEGY" == "systemctl" ]; then
systemctl restart $KUBELET_SYSTEMD_UNIT_NAME
fi
fi
echo writing $current_ver to $file
echo "$current_ver" > $file
}
ELB_BACKENDS_VERSION_FILE=${ELB_BACKENDS_VERSION_FILE:-/var/run/coreos/elb-backends-version}
KUBELET_SYSTEMD_UNIT_NAME=${KUBELET_SYSTEMD_UNIT_NAME:-kubelet.service}
KUBELET_RESTART_STRATEGY=${KUBELET_RESTART_STRATEGY:-systemctl}
WATCH_INTERVAL_SEC=${WATCH_INTERVAL_SEC:-3}
systemd-notify --ready
while true; do
systemd-notify --status "determining if there're changes in elb ips"
run_once
systemd-notify --status "sleeping for $WATCH_INTERVAL_SEC seconds"
sleep $WATCH_INTERVAL_SEC
done
Also - I have never realized that but we're creating CNAME records - which has its own TTL(=300 seconds by default in kube-aws) other than ELB's A records' TTL - for controller ELBs. https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/stack-template.json#L778 We'd better make them Route53 Alias records so that there will be only one TTL.
Updated my assumption on the issue in my first comment https://github.com/kubernetes-incubator/kube-aws/issues/598#issuecomment-297586503
Hmm, a TTL for a CNAME record associated to ELB's DNS name seems to be capped at 60s, even though kube-aws' default recordSetTTL
is set to 300s.
Then, we won't need to care about the balance of two TTLs(kube-aws CNAME and ELB's A) that much.
Anyway, to forcibly restart kubelet to reconnect apiserver(s) when necessary(=elb ips changed) would still be important. The script and the systemd unit I've suggested in https://github.com/kubernetes-incubator/kube-aws/issues/598#issuecomment-297615293 would be useful to achieve it.
@mumoshu does it mean that going ELB-less mode with just CNAME record in Route53 pointing to controller nodes is considered dangerous now?
@redbaron Do you mean a single CNAME record for your k8s API endpoint which points to just one controller node's DNS name?
I've been assuming that you would have a DNS name associated to one or more A records(rather than CNAME) each is associated to one of controller nodes public/private IP if you'd go without an ELB.
Doesn't change the fact that final set of A records returned from DNS request change when you trigger controllers ASG update, right? From what I read here it matches the case when ELBs change their IPs
@redbaron Ah, so you have a CNAME record which points to a DNS record containing multiple A records each is associated to one of controller nodes, right?
If so, no, as long as you have low TTLs for your DNS records.
AFAIK, someone in the upstream issue said that it becomes an issue only when ELB is involved. Perhaps ELB doesn't immediately shutdown an unnecessary instance and doesn't send FIN? If that's the case, kubelet would be unable to detect broken connections immediately.
When your controllers ASG is updated, old, unnecessary nodes would be terminated before it becomes non-functional like ELB's instances.
So your ELB-less mode with CNAME+A records would be safe as long as you have health checks to update the route 53 record set to eventually return A records "only for healthy controller nodes", and you have a TTL lower than --node-monitor-grace-period
ELB-less mode I refer to is a recent feature in kube-aws, I didn't check which records it creates exactly, just wanted to verify that is still a safe option to do considering this bug report.
@redbaron If you're referring to the DNS round-robin for API endpoints, it isn't implemented yet.
I wonder if using ALB can help here. L7 load balancer precisely knows all incoming/outgoing requests and can forcibly yet safely close HTTP connection when ELB scales/changes IP addresses.
@redbaron Thanks, I never realized that ALBs may help. I wish they can help, too. Although I'd like to hear from contributors in the upstream issue to kindly clarify whether ALBs help or not, shall we take this opportunity to add support for ALBs into kube-aws anyway? 😃
Today happened to me for the first time with a 128 days old cluster.
Seems like Amazon is doing some serious maintenance work in the ELB infrastructure these days...
@danielfm @camilb @redbaron For me it seems like there are still no obvious fixes other than:
Would you like to proceed with any of them or any other idea(s)?
Rolling-updates of controller nodes, sudden termination of controller nodes, unresponsive apiserver containers, etc
I think sudden termination, unresponsive apiserver and etc shouldn't cause problems as long as these events are not happening at the same time, but in this case even ELB setup fails.
But keeping route53 record up-to-date is a bit more complicated without using Lambda and SNS.
When adding/restarting or replacing(terminating and recreating) controller nodes we could have unit in cloud-config
that is executed when host comes up, this script gets all controller nodes that are part of ASG and modifies route53 record accordingly. But I don't know how to handle node removal (without adding new node back, although I think this is not happening very often).
Also I think it shouldn't be cause problems if route53 record is not fully up-to-date when restarting or replacing controller nodes, as long they are not replaced at the same time there should be at least one working controller.
Thanks @tarvip!
think sudden termination, unresponsive apiserver and etc shouldn't cause problems as long as these events are not happening at the same time, but in this case even ELB setup fails.
Sorry if I wasn't clear enough but the above concerns are not specific to this problem but more general. I just didn't want to introduce a new issue due to missing health checks / constant updates to route 53 record sets.
Also I think it shouldn't be cause problems if route53 record is not fully up-to-date when restarting or replacing controller nodes, as long they are not replaced at the same time there should be at least one working controller.
Probably that's true - then, my question is if everyone is ok with e.g. 50% k8s api error rate persists for several minutes when one of two controller nodes are being replaced?
But I don't know how to handle node removal (without adding new node back, although I think this is not happening very often).
If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?
@mumoshu , is ALB known not to help here?
I just didn't want to introduce a new issue due to missing health checks / constant updates to route 53 record sets.
I think missing health check is ok, that health check is just TCP check anyway, kubelet and kube-proxy can also detect connection failure and recreate connection to another host. Regarding constant updates, I guess we can perform update only if current state is different compared to route53.
Thanks @redbaron - No but it isn't known to help either.
If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?
Yes, that is one way to solve it.
@tarvip Thanks for the reply!
I think missing health check is ok, that health check is just TCP check anyway, kubelet and kube-proxy can also detect connection failure and recreate connection to another host.
Ah, makes sense to me! At least it seems to worth trying for me now - thanks for the clarification.
@tarvip
Regarding constant updates, I guess we can perform update only if current state is different compared to route53.
Agreed
As a bandaid, can't we query kubelet itself to cause it to go to apiserver for a response? If such query exist, then running it on a timer would let kubelet to detect problems sooner
As a bandaid, can't we query kubelet itself to cause it to go to apiserver for a response? If such query exist, then running it on a timer would let kubelet to detect problems sooner
Maybe my example is not very good but like etcdctl endpoint health
for etcd?
If so, that sounds good to me. Unfortunately I'm not aware of any concrete function of kubelet available for that.
50% k8s api error rate persists for several minutes when one of two controller nodes are being replaced?
I think this affects only other clients like kubectl
, not kube-proxy and kubelet. Kubelet and kube-proxy detect that one api server is down and recreate connection to other controller and use this connection (and controller) until something happens again with that connection, I tested this some time ago and it worked well, AFAIK they are not even refreshing their connections.
If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?
I think using Lambda is not a solution since it's not available on every AWS region.
@mumoshu I'm using your periodic elb monitoring and kubelet restarting script for the time being.
@tarvip
I think this affects only other clients like
kubectl
, not kube-proxy and kubelet. Kubelet and kube-proxy detect that one api server is down and recreate connection to other controller and use this connection (and controller) until something happens again with that connection
Thanks again for the clarification! It did make sense to me. Then, assuming we went ahead with the systemd timer method, there won't be huge problems as long as we set its interval short enough(could be several seconds at minimum)
Thanks @danielfm,
I think using Lambda is not a solution since it's not available on every AWS region.
Good point 👍
there won't be huge problems as long as we set its interval short enough
I think in normal situation it can be even longer, lets say once in a minute, but for updating cloudformation it must be shorter as replacing controller nodes can be quite fast.
Then there is question where to run this timer, should all controller nodes execute this timer? If using 2 or 3 nodes and also short interval for timer then hopefully it wont create additional problems if they all try to update route53 record at the same time.
@tarvip
Then there is question where to run this timer, should all controller nodes execute this timer?
Yes, in the simplest case, I believe so.
IMHO we can even introduce some kind of leader election so that only one controller node is responsible to update record sets. However it temps me to make the record sets updater script to be a dockerized golang app which implements leader election using k8s client like cluster-autoscaler and other k8s products do 😃
Hi @tarvip @danielfm,
Regarding the missing health-check thing, just a quick thought but can we deploy a nginx-ingress controller by default to load-balance against healthy apiserver pods only? nginx-ingress(or even the vanilla open-source edition of Nginx) doesn't support active-monitoring but does seem to support passive-monitoring:
to remove a failed backend from the list of backends for a while. It isn't a complete solution but better than nothing?
If it works then why not, but then there is the chicken and egg situation when bootstrapping cluster, kubelet and kube-proxy cannot connect to api as there is no ingress controller yet, but you cannot deploy ingress controller because there are no worker nodes available.
@tarvip Thanks! I was blindly thinking about deploying an ingress-controller as a static pod per controllre node but obviously it doesn't work as we have no way to provid an ingress resource for it 😢
I ran into this issue a few weeks ago myself. Changed the CNAME TTL at the time but it appears that isn't really a fix. If I'm not mistaken, it seems like the check/restart script is probably the most reliable solution for the time being. Based on the upstream issue, kubelet is keeping the connection open for up to 15 mins because it didn't receive any data telling it to close. When the ELB instances are terminated and the old IPs are no longer available, it seems that kubelet never receives any data like it normally would if your controller is shut down, the controller manager process was killed, etc..
DNS round robin would probably fix most cases since it gets rid of ELB which wasn't sending any data when the connection is closed... But I imagine there is still some possibility that other unexpected connection issues could result in kubelet just sitting there thinking nothing happened. Seems like a lot of work to implement in kube-aws, and doesn't change the fact that kubelet needs some improvement :).
FYI ALB doesn't seem to fit as a k8s API load balancer. Please see #608 and the corresponding PR.
So, we would need to settle on some of solutions other than ALB anyway.
Hi @roybotnik, thanks for sharing the hard won experience 👍 Good to know the CNAME TTL fix is missing the point.
Seems like a lot of work to implement in kube-aws
Could be 😄
@danielfm @redbaron @camilb @tarvip @roybotnik
In the meantime, would you agree that we should do the followings in order of priority?
I agree, although I'm not using this script at the moment.
At the moment we have increased only --node-monitor-grace-period=90s
in kube-controller-manager
, I don't like it, as it is not fast enough for detecting if node is really down, but I thought if it happens again I will set up DNS round-robin, but we haven't seen this issue since February.
Also maybe that script should restart also kube-proxy
.
I was thinking about this yesterday, and a simple solution that works around the kubelet connection timeout issue (without caring about ELB specifically) might be to query the API server itself for the node's status.
Since kube-controller-manager won't evict pods until a node has been offline for a while by default (5 minutes I believe), we can query every 60s or so to see if the node status is not ready and then restart kubelet. I haven't tested this fully and the script was kinda thrown together so it could probably be simplified (not great w/ bash script):
cloud-config-worker
- name: node-status-check.service
enable: true
command: start
runtime: true
content: |
[Unit]
Description=Restart kubelet if node is not ready
[Service]
Environment=API_ENDPOINT_DNS_NAME={{.APIEndpoint.DNSName}}
ExecStart=/opt/bin/node-status-check
Restart=always
RestartSec=60
/opt/bin/node-status-check
#!/usr/bin/env bash
# Get node info from API server
NODE_INFO=$(curl https://$API_ENDPOINT_DNS_NAME/api/v1/nodes/$HOSTNAME --key /etc/kubernetes/ssl/worker-key.pem --cert /etc/kubernetes/ssl/worker.pem --cacert /etc/kubernetes/ssl/ca.pem --cert-type PEM -s)
if [ $? -ne 0 ]; then
echo "Unable to query API endpoint. Verify API endpoint is online and accessible to nodes."
exit 1;
fi
# Select node conditions from json response
NODE_CONDITIONS=$(jq '.status.conditions[] | {(.type): .status}' -M -c <<< "$NODE_INFO")
if [ $? -ne 0 ]; then
echo "API returned an invalid JSON response, unable to check status."
exit 1;
fi
echo "Current node conditions:"
echo $NODE_CONDITIONS
if grep -q "\\\"Ready\\\":\\\"True\\\"" <<< $NODE_CONDITIONS
then
echo "API server reports node is ready, no actions taken"
else
echo "API server reports node status not ready, restarting kubelet"
systemctl restart kubelet.service
fi
Hi @danielfm @redbaron @camilb @tarvip @roybotnik, have you encountered this issue since then? I still had no luck(??) to encounter this issue for my clusters and therefore am unable to decide this as resolved or not.
Didn't happen to us yet, but upstream issue has a clean way to reproduce problem without ELB or any cloud provide whatsoever.
We haven't seen this since initial bug report, but we also increased node-monitor-grace-period
, maybe it helps.
Hi @mumoshu didn't happen to me since then and only happened once.
I ran into https://github.com/kubernetes/kubernetes/issues/41916 twice in the last 3 days in my production cluster, with almost 50% of worker nodes transitioning to
NotReady
state almost simultaneously in both days, causing a brief downtime in critical services due to Kubernetes default (and agressive) eviction policy for failing nodes.I just contacted AWS support to validate the hypothesis of the ELB changing IPs at the time of both incidents, and the answer was yes.
My configuration (multi-node control plane with ELB) matches exactly the one in that issue, and probably most kube-aws users are subject to this.
Have anyone else ran into this at some point?