master: controller-manager is failing due to timing out while accessing /apis/metrics.k8s.io/v1beta1 after node rolling-update

mumoshu commented 6 years ago

Probably after kube-aws render and then kube-aws update, my kube-aws cluster has fallen into the UPDATE_ROLLBACK_FAILED state.

Logging-in and then surveying various points, I've verified:

no systemd units are failing
but docker ps -a shows that the controller-manager container is failing
sudo cat /var/log/pods/**/*kube-controller-manager*.log showed that it is failing due to a timeout while accessing the metrics server endpoint(?)

{"log":"E1127 08:25:48.551113       1 controllermanager.go:399] unable to get all supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: an error on the server (\"Error: 'dial tcp 10.3.0.126:443: i/o timeout'\\nTrying to reach: 'https://10.3.0.126:443/apis/metrics.k8s.io/v1beta1'\") has prevented the request from succeeding\n","stream":"stderr","time":"2017-11-27T08:25:48.551355721Z"}

Also, docker logs $apiserver_container_id shows:

I1127 06:54:45.503817       1 autoregister_controller.go:136] Starting autoregister controller
I1127 06:54:45.503867       1 cache.go:32] Waiting for caches to sync for autoregister controller
I1127 06:54:45.604058       1 cache.go:39] Caches are synced for autoregister controller
I1127 06:54:47.975710       1 controller.go:105] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
E1127 06:55:17.976203       1 controller.go:111] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error: 'dial tcp 10.3.0.126:443: i/o timeout'
Trying to reach: 'https://10.3.0.126:443/swagger.json', Header: map[]
I1127 06:55:17.976222       1 controller.go:119] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
I1127 06:56:17.976460       1 controller.go:105] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
E1127 06:56:47.977085       1 controller.go:111] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error: 'dial tcp 10.3.0.126:443: i/o timeout'
Trying to reach: 'https://10.3.0.126:443/swagger.json', Header: map[]
I1127 06:56:47.977106       1 controller.go:119] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
I1127 06:58:47.977330       1 controller.go:105] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io

cc @camilb

mumoshu commented 6 years ago

Recap:

Context

controller-manager requires apiserver to be responsive.
- Otherwise it falls into a crash loopback.
apiserver requires metrics-server to be responsive in the first place in order to become responsive against apiservice discovery queries made by clients like kubectl.
When the final and single controller node in your cluster was recreated, apiserver on the recreated node becomes super slow until it restores connection to metrics-server.
kube-proxy is required to create those iptables rules.
- In other words, iptables rules are empty until kube-proxy starts.
daemonset-backed kube-proxy pods are created by a working controller-manager

Failure Scenario

When the final and single controller node in your cluster was recreated,

apiserver on the recreated node becomes super slow
controller-manager falls into a crash loopback due to apiserver unresponsiveness
a daemonset-backed kube-proxy pod is never created due to controller-manager failure.
As there's no kube-proxy to do so, no iptables rules hence no connection from apiserver to metricserver are created.
apiserver keeps being super slow due to missing connection to metrics-server
Loop

Note that sooner or later it results in cfn-signal to timeout due to one of the following reasons:

install-kube-systems.service takes much longer time than in a normal condition
cfn-signal times out while retrying health-check against kube-proxy

Solution

a.k.a work-around

When the apiservice exists but kube-proxy, temporary delete the apiservice(not the metrics-server service) so that apiserver stops trying to communicate with metrics-server, result in it to become responsive again, making controller-manager happy enough to finally create a daemonset-backed kube-proxy.

Also, in install-kube-system.service add --request-timeout=1s to every kubectll call to speedy apply this work-around on the slowed apiserver.

mumoshu commented 6 years ago

I did it. With the solution above, the main part of install-kube-system script now takes only 1 minute even when the failure condition was there.

      # See https://github.com/kubernetes-incubator/kube-aws/issues/1039#issuecomment-348978375
      if ks get apiservice v1beta1.metrics.k8s.io && ! ps ax | grep '[h]yperkube proxy'; then
        echo "apiserver is up but kube-proxy isn't up. We have likely encountered #1039."
        echo "Temporary deleting the v1beta1.metrics.k8s.io apiservice as a work-around for #1039"
        ks delete apiservice v1beta1.metrics.k8s.io

        echo Waiting until controller-manager stabilizes and it creates a kube-proxy pod.
        until ps ax | grep '[h]yperkube proxy'; do
            echo Sleeping 3 seconds.
            sleep 3
        done
        echo kube-proxy stared. apiserver should be responsive again.
      fi

core@ip-10-0-0-156 ~ $ journalctl -u install-kube-system | cat
-- Logs begin at Mon 2017-12-04 16:22:42 UTC, end at Mon 2017-12-04 16:45:19 UTC. --
Dec 04 16:23:09 ip-10-0-0-156.ap-northeast-1.compute.internal systemd[1]: Starting install-kube-system.service...
Dec 04 16:23:09 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: activating
Dec 04 16:23:09 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: waiting until kubelet starts
Dec 04 16:23:19 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: activating
Dec 04 16:23:19 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: waiting until kubelet starts
Dec 04 16:23:29 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: activating
Dec 04 16:23:29 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: waiting until kubelet starts
Dec 04 16:23:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: active
Dec 04 16:23:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1697]: active
Dec 04 16:23:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:23:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:23:59 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:09 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:19 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:29 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: {
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "major": "1",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "minor": "8+",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "gitVersion": "v1.8.4+coreos.0",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "gitCommit": "4292f9682595afddbb4f8b1483673449c74f9619",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "gitTreeState": "clean",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "buildDate": "2017-11-21T17:22:25Z",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "goVersion": "go1.8.3",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "compiler": "gc",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]:   "platform": "linux/amd64"
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: }
Dec 04 16:24:56 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: NAME          STATUS    AGE
Dec 04 16:24:56 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: kube-system   Active    11m
Dec 04 16:25:04 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: NAME                     AGE
Dec 04 16:25:04 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: v1beta1.metrics.k8s.io   11m
Dec 04 16:25:05 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: apiserver is up but kube-proxy isn't up. We have likely encountered #1039.
Dec 04 16:25:05 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Temporary deleting the v1beta1.metrics.k8s.io apiservice as a work-around for #1039
Dec 04 16:25:12 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: apiservice "v1beta1.metrics.k8s.io" deleted
Dec 04 16:25:12 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Waiting until controller-manager stabilizes and it creates a kube-proxy pod.
Dec 04 16:25:12 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:15 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:18 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:21 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:24 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:27 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:30 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:33 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]:  2429 ?        Ssl    0:00 /hyperkube proxy --config=/etc/kubernetes/kube-proxy/kube-proxy-config.yaml
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: kube-proxy stared. apiserver should be responsive again.
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: configmap "calico-config" unchanged
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: secret "calico-etcd-secrets" unchanged
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "calico-node" configured
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "calico-kube-controllers" unchanged
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: secret "kubernetes-dashboard-certs" configured
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: configmap "kube-dns" unchanged
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: configmap "kube-proxy-config" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kube-dns" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "heapster" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kube-proxy" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kubernetes-dashboard" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "metrics-server" configured
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "tiller-deploy" configured
Dec 04 16:25:39 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "tiller-deploy" configured
Dec 04 16:25:39 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "dnsmasq-node" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kube-dns" configured
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kube-dns-autoscaler" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kubernetes-dashboard" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "cluster-autoscaler" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "heapster" configured
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "metrics-server" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "kube-proxy" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "kube-dns" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "heapster" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "kubernetes-dashboard" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "metrics-server" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kube-rescheduler" unchanged
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube-aws:node-extensions" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "system:metrics-server" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:admin" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:system-worker" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node-proxier" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node-extensions" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "heapster" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "metrics-server:system:auth-delegator" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "system:metrics-server" configured
Dec 04 16:25:43 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kubernetes-dashboard" configured
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: role "system:pod-nanny" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: role "kubernetes-dashboard-minimal" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: rolebinding "heapster-nanny" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: rolebinding "kubernetes-dashboard-minimal" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: rolebinding "metrics-server-auth-reader" unchanged
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube-aws:node-bootstrapper" configured
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube-aws:kubelet-certificate-bootstrap" configured
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node-bootstrapper" configured
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:kubelet-certificate-bootstrap" configured
Dec 04 16:25:46 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kube2iam" unchanged
Dec 04 16:25:46 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube2iam" configured
Dec 04 16:25:46 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube2iam" configured
Dec 04 16:25:47 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "kube2iam" unchanged
Dec 04 16:25:47 ip-10-0-0-156.ap-northeast-1.compute.internal systemd[1]: Started install-kube-system.service.

I wish this won't be a long-term solution 😉

Update: Verified that the new work-around isn't triggered when in normal condition like there is already a running controller node other than the one running install-kube-system, as expected.

tsuna commented 6 years ago

Should the apiservice v1beta1.metrics.k8s.io somehow get recreated after applying the workaround?

sohel2020 commented 5 years ago

@mumoshu you have deleted the apiservice v1beta1.metrics.k8s.io but how did you recreated after cluster bootstrap?

mumoshu commented 5 years ago

@tsuna @sohel2020 Nope. The commands in the {{ if .Addons.MetricsServer.Enabled -}} block in install-kube-system of cloud-config-controller should recreate it, along with serviceaccount, daemonset, svc, and so on.

kubernetes-retired / kube-aws