Closed mumoshu closed 6 years ago
Recap:
When the final and single controller node in your cluster was recreated,
Note that sooner or later it results in cfn-signal to timeout due to one of the following reasons:
a.k.a work-around
When the apiservice exists but kube-proxy, temporary delete the apiservice(not the metrics-server service) so that apiserver stops trying to communicate with metrics-server, result in it to become responsive again, making controller-manager happy enough to finally create a daemonset-backed kube-proxy.
Also, in install-kube-system.service add --request-timeout=1s
to every kubectll call to speedy apply this work-around on the slowed apiserver.
I did it. With the solution above, the main part of install-kube-system script now takes only 1 minute even when the failure condition was there.
# See https://github.com/kubernetes-incubator/kube-aws/issues/1039#issuecomment-348978375
if ks get apiservice v1beta1.metrics.k8s.io && ! ps ax | grep '[h]yperkube proxy'; then
echo "apiserver is up but kube-proxy isn't up. We have likely encountered #1039."
echo "Temporary deleting the v1beta1.metrics.k8s.io apiservice as a work-around for #1039"
ks delete apiservice v1beta1.metrics.k8s.io
echo Waiting until controller-manager stabilizes and it creates a kube-proxy pod.
until ps ax | grep '[h]yperkube proxy'; do
echo Sleeping 3 seconds.
sleep 3
done
echo kube-proxy stared. apiserver should be responsive again.
fi
core@ip-10-0-0-156 ~ $ journalctl -u install-kube-system | cat
-- Logs begin at Mon 2017-12-04 16:22:42 UTC, end at Mon 2017-12-04 16:45:19 UTC. --
Dec 04 16:23:09 ip-10-0-0-156.ap-northeast-1.compute.internal systemd[1]: Starting install-kube-system.service...
Dec 04 16:23:09 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: activating
Dec 04 16:23:09 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: waiting until kubelet starts
Dec 04 16:23:19 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: activating
Dec 04 16:23:19 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: waiting until kubelet starts
Dec 04 16:23:29 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: activating
Dec 04 16:23:29 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: waiting until kubelet starts
Dec 04 16:23:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1102]: active
Dec 04 16:23:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1697]: active
Dec 04 16:23:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:23:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:23:59 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:09 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:19 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:29 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:39 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: waiting until apiserver starts
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: {
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "major": "1",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "minor": "8+",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "gitVersion": "v1.8.4+coreos.0",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "gitCommit": "4292f9682595afddbb4f8b1483673449c74f9619",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "gitTreeState": "clean",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "buildDate": "2017-11-21T17:22:25Z",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "goVersion": "go1.8.3",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "compiler": "gc",
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: "platform": "linux/amd64"
Dec 04 16:24:49 ip-10-0-0-156.ap-northeast-1.compute.internal bash[1701]: }
Dec 04 16:24:56 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: NAME STATUS AGE
Dec 04 16:24:56 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: kube-system Active 11m
Dec 04 16:25:04 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: NAME AGE
Dec 04 16:25:04 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: v1beta1.metrics.k8s.io 11m
Dec 04 16:25:05 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: apiserver is up but kube-proxy isn't up. We have likely encountered #1039.
Dec 04 16:25:05 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Temporary deleting the v1beta1.metrics.k8s.io apiservice as a work-around for #1039
Dec 04 16:25:12 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: apiservice "v1beta1.metrics.k8s.io" deleted
Dec 04 16:25:12 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Waiting until controller-manager stabilizes and it creates a kube-proxy pod.
Dec 04 16:25:12 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:15 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:18 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:21 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:24 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:27 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:30 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:33 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Sleeping 3 seconds.
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: 2429 ? Ssl 0:00 /hyperkube proxy --config=/etc/kubernetes/kube-proxy/kube-proxy-config.yaml
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: kube-proxy stared. apiserver should be responsive again.
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: configmap "calico-config" unchanged
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: secret "calico-etcd-secrets" unchanged
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "calico-node" configured
Dec 04 16:25:36 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "calico-kube-controllers" unchanged
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: secret "kubernetes-dashboard-certs" configured
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: configmap "kube-dns" unchanged
Dec 04 16:25:37 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: configmap "kube-proxy-config" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kube-dns" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "heapster" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kube-proxy" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kubernetes-dashboard" unchanged
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "metrics-server" configured
Dec 04 16:25:38 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "tiller-deploy" configured
Dec 04 16:25:39 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "tiller-deploy" configured
Dec 04 16:25:39 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "dnsmasq-node" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kube-dns" configured
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kube-dns-autoscaler" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kubernetes-dashboard" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "cluster-autoscaler" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "heapster" configured
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "metrics-server" unchanged
Dec 04 16:25:40 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "kube-proxy" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "kube-dns" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "heapster" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "kubernetes-dashboard" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: service "metrics-server" unchanged
Dec 04 16:25:41 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: deployment "kube-rescheduler" unchanged
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube-aws:node-extensions" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "system:metrics-server" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:admin" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:system-worker" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node-proxier" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node-extensions" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "heapster" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "metrics-server:system:auth-delegator" configured
Dec 04 16:25:42 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "system:metrics-server" configured
Dec 04 16:25:43 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kubernetes-dashboard" configured
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: role "system:pod-nanny" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: role "kubernetes-dashboard-minimal" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: rolebinding "heapster-nanny" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: rolebinding "kubernetes-dashboard-minimal" unchanged
Dec 04 16:25:44 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: rolebinding "metrics-server-auth-reader" unchanged
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube-aws:node-bootstrapper" configured
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube-aws:kubelet-certificate-bootstrap" configured
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:node-bootstrapper" configured
Dec 04 16:25:45 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube-aws:kubelet-certificate-bootstrap" configured
Dec 04 16:25:46 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: serviceaccount "kube2iam" unchanged
Dec 04 16:25:46 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrole "kube2iam" configured
Dec 04 16:25:46 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: clusterrolebinding "kube2iam" configured
Dec 04 16:25:47 ip-10-0-0-156.ap-northeast-1.compute.internal retry[2069]: daemonset "kube2iam" unchanged
Dec 04 16:25:47 ip-10-0-0-156.ap-northeast-1.compute.internal systemd[1]: Started install-kube-system.service.
I wish this won't be a long-term solution 😉
Update: Verified that the new work-around isn't triggered when in normal condition like there is already a running controller node other than the one running install-kube-system, as expected.
Should the apiservice v1beta1.metrics.k8s.io
somehow get recreated after applying the workaround?
@mumoshu you have deleted the apiservice v1beta1.metrics.k8s.io
but how did you recreated after cluster bootstrap?
@tsuna @sohel2020 Nope. The commands in the {{ if .Addons.MetricsServer.Enabled -}}
block in install-kube-system
of cloud-config-controller
should recreate it, along with serviceaccount, daemonset, svc, and so on.
Probably after
kube-aws render
and thenkube-aws update
, my kube-aws cluster has fallen into the UPDATE_ROLLBACK_FAILED state.Logging-in and then surveying various points, I've verified:
docker ps -a
shows that thecontroller-manager
container is failingsudo cat /var/log/pods/**/*kube-controller-manager*.log
showed that it is failing due to a timeout while accessing the metrics server endpoint(?)Also,
docker logs $apiserver_container_id
shows:cc @camilb