Closed jeremyd closed 7 years ago
do you have kubectl describe
output for a failing pod?
Crap, I just went to find one on my test cluster and the cluster actually had it running.. So must be intermittant behavior. I'll recycle and see if I can collect more info.
@jeremyd Would you mind sharing me your cluster.yaml to reproduce the issue?
kube-dns and kubernetes-dashboard tree fail to create for me as well on the latest kube-aws 0.9.6-rc1 when I try to use kubernetes older than 1.6. Unlike the situation described by @jeremyd, my pods don't even create, so I cannot do kubectl describe.
These are the pods I get upon cluster start:
[cbdev@localhost kubernetes]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-apiserver-ip-172-1-200-201.ec2.internal 1/1 Running 0 12m
kube-system kube-apiserver-ip-172-1-201-212.ec2.internal 1/1 Running 0 12m
kube-system kube-apiserver-ip-172-1-202-22.ec2.internal 1/1 Running 0 13m
kube-system kube-controller-manager-ip-172-1-200-201.ec2.internal 1/1 Running 0 13m
kube-system kube-controller-manager-ip-172-1-201-212.ec2.internal 1/1 Running 1 13m
kube-system kube-controller-manager-ip-172-1-202-22.ec2.internal 1/1 Running 1 13m
kube-system kube-proxy-ip-172-1-200-114.ec2.internal 1/1 Running 0 7m
kube-system kube-proxy-ip-172-1-200-147.ec2.internal 1/1 Running 0 6m
kube-system kube-proxy-ip-172-1-200-201.ec2.internal 1/1 Running 0 13m
kube-system kube-proxy-ip-172-1-200-247.ec2.internal 1/1 Running 0 6m
kube-system kube-proxy-ip-172-1-200-252.ec2.internal 1/1 Running 0 6m
kube-system kube-proxy-ip-172-1-200-50.ec2.internal 1/1 Running 0 7m
kube-system kube-proxy-ip-172-1-201-170.ec2.internal 1/1 Running 0 6m
kube-system kube-proxy-ip-172-1-201-212.ec2.internal 1/1 Running 0 13m
kube-system kube-proxy-ip-172-1-201-221.ec2.internal 1/1 Running 0 7m
kube-system kube-proxy-ip-172-1-201-240.ec2.internal 1/1 Running 0 6m
kube-system kube-proxy-ip-172-1-201-38.ec2.internal 1/1 Running 0 6m
kube-system kube-proxy-ip-172-1-201-54.ec2.internal 1/1 Running 0 6m
kube-system kube-proxy-ip-172-1-202-156.ec2.internal 1/1 Running 0 7m
kube-system kube-proxy-ip-172-1-202-169.ec2.internal 1/1 Running 0 7m
kube-system kube-proxy-ip-172-1-202-191.ec2.internal 1/1 Running 0 7m
kube-system kube-proxy-ip-172-1-202-22.ec2.internal 1/1 Running 0 12m
kube-system kube-proxy-ip-172-1-202-65.ec2.internal 1/1 Running 0 7m
kube-system kube-proxy-ip-172-1-202-98.ec2.internal 1/1 Running 0 6m
kube-system kube-scheduler-ip-172-1-200-201.ec2.internal 1/1 Running 0 12m
kube-system kube-scheduler-ip-172-1-201-212.ec2.internal 1/1 Running 1 13m
kube-system kube-scheduler-ip-172-1-202-22.ec2.internal 1/1 Running 1 12m`
I tried 1.5.5 and 1.5.6 and both fail to create. This problem doesn't seem to happen for me when I use kubernetes 1.6.1
I have a hard limitation at the moment so cannot use kubernetes +1.6. So I downgraded to kube-aws 0.9.5 and I can use kubernetes 1.5.5 with no issues.
This issue is intermittent. We finally got a cluster booted up and DNS stayed up and it's looking good. (after failing 3 times or so with the same config). I tried 1.6.1 and 1.5.6 and they both had the issue so I think it's just an intermittent problem for both versions. Here's my cluster config @mumoshu https://gist.github.com/jeremyd/9d6b86aeacbaf7f6044f9497aecfba72
Experienced this twice in a row on master 30dccf548e78fc8e96569fbf680c7098dff3f771. It looks like a couple of separate issues but maybe inter-related. Both dashboard and some DNS pods are in the CrashLoopBackOff
state. I think it's related to usage of apiEndpoints
in my case.
✗ stern --namespace=kube-system kubernetes-dashboard
+ kubernetes-dashboard-v1.5.1-99tck › kubernetes-dashboard
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Using HTTP port: 9090
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Creating API server client for https://10.3.0.1:443
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: the server has asked for the client to provide credentials
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Refer to the troubleshooting guide for more information: https://github.com/kubernetes/dashboard/blob/master/docs/user-guide/troubleshooting.md
✗ stern --namespace=kube-system kube-dns
+ kube-dns-3816048056-lmnrn › dnsmasq-metrics
+ kube-dns-3816048056-lmnrn › healthz
+ kube-dns-3816048056-lmnrn › kubedns
+ kube-dns-autoscaler-1464605019-p8ds3 › autoscaler
[...]
kube-dns-3816048056-lmnrn dnsmasq-metrics ERROR: logging before flag.Parse: I0419 07:36:06.595875 1 main.go:38] dnsmasq-metrics v1.0
kube-dns-3816048056-lmnrn dnsmasq-metrics ERROR: logging before flag.Parse: I0419 07:36:06.596209 1 server.go:44] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:dnsmasq PrometheusSubsystem:cache})
kube-dns-3816048056-lmnrn dnsmasq-metrics ERROR: logging before flag.Parse: W0419 07:36:06.596952 1 server.go:53] Error getting metrics from dnsmasq: read udp 127.0.0.1:36209->127.0.0.1:53: read: connection refused
[...]
kube-dns-3816048056-lmnrn healthz 2017/04/19 07:37:23 Healthz probe on /healthz-kubedns error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:37:22.670736899 +0000 UTC, error exit status 1
kube-dns-3816048056-lmnrn healthz 2017/04/19 07:37:23 Healthz probe on /healthz-dnsmasq error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:37:22.670780949 +0000 UTC, error exit status 1
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.333018 1 dns.go:42] version: v1.6.0-alpha.0.680+3872cb93abf948-dirty
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335187 1 server.go:107] Using https://10.3.0.1:443 for kubernetes master, kubernetes API: <nil>
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:41:32.670965404 +0000 UTC, error exit status 1
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:42:12.671813145 +0000 UTC, ekube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335697 1 server.go:68] Using configuration read from ConfigMap: kube-system:kube-dns
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335740 1 server.go:113] FLAG: --alsologtostderr="false"
rror exit status 1
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335760 1 server.go:113] FLAG: --config-map="kube-dns"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335768 1 server.go:113] FLAG: --config-map-namespace="kube-system"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335773 1 server.go:113] FLAG: --dns-bind-address="0.0.0.0"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335778 1 server.go:113] FLAG: --dns-port="10053"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335788 1 server.go:113] FLAG: --domain="cluster.local."
[...]
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335798 1 server.go:113] FLAG: --federations=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335804 1 server.go:113] FLAG: --healthz-port="8081"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335810 1 server.go:113] FLAG: --kube-master-url=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335818 1 server.go:113] FLAG: --kubecfg-file=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335823 1 server.go:113] FLAG: --log-backtrace-at=":0"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335831 1 server.go:113] FLAG: --log-dir=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335838 1 server.go:113] FLAG: --log-flush-frequency="5s"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335846 1 server.go:113] FLAG: --logtostderr="true"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335852 1 server.go:113] FLAG: --stderrthreshold="2"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335859 1 server.go:113] FLAG: --v="2"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335864 1 server.go:113] FLAG: --version="false"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335877 1 server.go:113] FLAG: --vmodule=""
[...]
kube-dns-3816048056-lmnrn dnsmasq-metrics 2017/04/19 07:37:23 Healthz probe on /healthz-kubedns error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn dnsmasq-metrics 2017/04/19 07:37:23 Healthz probe on /healthz-kubedns error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336034 1 server.go:155] Starting SkyDNS server (0.0.0.0:10053)
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336333 1 server.go:165] Skydns metrics enabled (/metrics:10055)
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336349 1 dns.go:144] Starting endpointsController
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336354 1 dns.go:147] Starting serviceController
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336738 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336760 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.362716 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints)
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.363270 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services)
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.363661 1 sync.go:105] Error getting ConfigMap kube-system:kube-dns err: the server has asked for the client to provide credentials (get configmaps kube-dns)
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.363690 1 dns.go:190] Error getting initial ConfigMap: the server has asked for the client to provide credentials (get configmaps kube-dns), starting with default values
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.363719 1 dns.go:163] Waiting for Kubernetes service
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.363763 1 dns.go:169] Waiting for service: default/kubernetes
[...]
kube-dns-autoscaler-1464605019-p8ds3 autoscaler I0419 07:36:06.676194 1 autoscaler.go:49] Scaling Namespace: kube-system, Target: deployment/kube-dns, Mode: linear
kube-dns-autoscaler-1464605019-p8ds3 autoscaler E0419 07:36:06.710613 1 autoscaler_server.go:96] Error while getting cluster status: the server has asked for the client to provide credentials (get nodes)
kube-dns-autoscaler-1464605019-p8ds3 autoscaler E0419 07:37:36.678574 1 autoscaler
_server.go:96] Error while getting cluster status: the server has asked for the client to provide credentials (get nodes)
I have 2 controllers, 3 etcd nodes, apiEndpoints config as per https://github.com/kubernetes-incubator/kube-aws/issues/527#issue-220837426
I've just done a full cluster recreate and seems ok this time. My errors before were after doing a a render of credentials, stack and then update
. @jeremyd in your case, you said it was a new cluster to just to confirm it was an entirely new stack using kube-aws up
?
Hello All, I'm also experiencing the same problems and symptoms as @jeremyd after deploying the latest kube-aws
kube-aws version
kube-aws version v0.9.6-rc.2
What is strange is that I just deployed a cluster a few days ago and it uses the same CoreOS AMI, kube-aws version, cluster.yaml file, etc., so this is a bit curious.
We are using nodepools which is new for us with this and the previous release. The cluster I deployed a few days ago was initially deployed as H/A with multi-AZ nodepools. I then updated that cluster by commenting out one of my nodepools code sections in cluster.yaml (i.e. disable us-west-1b nodepool) and simply ran kube-aws update
on it and everything worked perfectly--with cfn tearing down the stack, etc. I saw no errors or issues with that cluster, and checking it just now all kube-system pods appear healthy:
DEV / working cluster deployed a few days ago:
export KUBECONFIG=`pwd`/kubeconfig
SHOW_KUBECONFIG
ops-dev.dev.terradatum.com
kk get po
NAME READY STATUS RESTARTS AGE
heapster-v1.3.0-76786035-g270k 2/2 Running 0 1d
kube-apiserver-ip-10-1-17-152.us-west-1.compute.internal 1/1 Running 0 1d
kube-controller-manager-ip-10-1-17-152.us-west-1.compute.internal 1/1 Running 0 1d
kube-dns-3816048056-5kp1m 4/4 Running 0 1d
kube-dns-3816048056-fx3qx 4/4 Running 0 1d
kube-dns-autoscaler-1464605019-nsb7m 1/1 Running 0 1d
kube-proxy-ip-10-1-16-116.us-west-1.compute.internal 1/1 Running 0 1d
kube-proxy-ip-10-1-17-152.us-west-1.compute.internal 1/1 Running 0 1d
kube-scheduler-ip-10-1-17-152.us-west-1.compute.internal 1/1 Running 0 1d
kubernetes-dashboard-v1.5.1-cqr8j 1/1 Running 0 1d
tiller-deploy-1172528075-qr2mv 1/1 Running 0 1d
Latest sick / sadpanda cluster
export KUBECONFIG=`pwd`/kubeconfig
SHOW_KUBECONFIG
aergo-prod.terradatum.com
kk get po
NAME READY STATUS RESTARTS AGE
heapster-v1.3.0-268032834-jc740 2/2 Running 0 2h
kube-apiserver-ip-10-1-14-191.us-west-1.compute.internal 1/1 Running 0 1h
kube-controller-manager-ip-10-1-14-191.us-west-1.compute.internal 1/1 Running 0 1h
kube-dns-3816048056-6c4x5 2/4 CrashLoopBackOff 30 2h
kube-dns-autoscaler-1464605019-q0xlf 1/1 Running 0 2h
kube-proxy-ip-10-1-14-191.us-west-1.compute.internal 1/1 Running 0 1h
kube-proxy-ip-10-1-14-82.us-west-1.compute.internal 1/1 Running 0 53m
kube-proxy-ip-10-1-15-50.us-west-1.compute.internal 1/1 Running 0 53m
kube-scheduler-ip-10-1-14-191.us-west-1.compute.internal 1/1 Running 0 1h
kubernetes-dashboard-v1.5.1-p0m98 0/1 CrashLoopBackOff 15 2h
tiller-deploy-1332173772-bxz2j 0/1 CrashLoopBackOff 13 42m
This cluster deploy was initially from a blank / clean initial cluster, then each deployment attempt since then has used the same dir, with some kube-aws render stack
commands as necessary.
Thanks @jeremyd for posting your cluster.yaml file in the above gist, it's very helpful for me/us to see how others are deploying.
I can post our cluster.yaml file, but again note that it was working a few days ago with previous successful cluster deploy.
very likely flannel is broken on controller nodes: https://github.com/kubernetes-incubator/kube-aws/pull/558 not sure if it isrelated to the problem you see, but worth checking
Whatever it is it's only broken when I use 0.9.6+ with kubernetes 1.5.6 or 1.5.5. I tried this today again with 0.9.6rc2 and saw same issue. Issue is not there for 0.9.5
Thanks @jcastillo-cb / All,
'Was trying to run with the most recent versions possible, so tried reverting/bumping down one minor RC, to the previous release v0.9.6-rc.1
and as @jcastillo-cb noted above, I can confirm that we're still seeing the problem(s).
kube-aws version
kube-aws version v0.9.6-rc.1
'Tried deleting the pods to no avail. . .
kk get po
NAME READY STATUS RESTARTS AGE
heapster-v1.3.0-268032834-dnc2g 2/2 Running 0 4h
kube-apiserver-ip-10-1-14-156.us-west-1.compute.internal 1/1 Running 0 1h
kube-controller-manager-ip-10-1-14-156.us-west-1.compute.internal 1/1 Running 0 1h
kube-dns-3816048056-6g9wq 3/4 Running 6 4h
kube-dns-autoscaler-1464605019-20t6q 1/1 Running 0 4h
kube-proxy-ip-10-1-14-156.us-west-1.compute.internal 1/1 Running 0 1h
kube-proxy-ip-10-1-14-31.us-west-1.compute.internal 1/1 Running 0 9m
kube-proxy-ip-10-1-15-87.us-west-1.compute.internal 1/1 Running 0 9m
kube-scheduler-ip-10-1-14-156.us-west-1.compute.internal 1/1 Running 0 1h
kubernetes-dashboard-v1.5.1-nlvb3 0/1 CrashLoopBackOff 2 35s
tiller-deploy-1332173772-smrg1 0/1 CrashLoopBackOff 2 35s
So, I will be reverting to 0.9.5 as well.
JIC it's helpful here is a gist with all of the errors/logs for all kube-dns containers
for i in kubedns dnsmasq dnsmasq-metrics healthz; do kk logs kube-dns-3816048056-6g9wq $i; done
all kube-dns containers error/logs
@cmcconnell1 @jcastillo-cb @jeremyd @c-knowles Sorry for being a bit late to the party. First of all, please let me clarify that starting v0.9.6-rc.1, k8s older than v1.6.0 is not supported! Would you mind confirming your kubernetes versions and if you're using v1.5.x, would using v1.6.x instead solve your issues?
In my case above I'm not customising the kube version so it's the default on master - 1.6.1 I believe.
@cmcconnell1 @jcastillo-cb @jeremyd @c-knowles Would you mind letting me know if you've enabled Calico on your cluster?
Hi @mumoshu / All, I'm not hard coding any specific kubernetes (or other dependencies) versions in my configurations. I typically take the default and then only pin things, such as an AMI if needed--perhaps when things change and I can no longer deploy a cluster without using an older image, etc.
I tried both default most recent kube-aws versions 0.9.6-rc{1,2} and was only able to deploy a healthy / working kube cluster once last week at 2017-04-17-14:01:53-PDT to be exact. But, please note that haven't been able to deploy another cluster on 0.9.6 since sometime after that.
I create wrapper and helper scripts to assist with things like this, so I can go back to that cluster and extract the details of when it was deployed, etc.
The below details are from when I was able to deploy a cluster on 2017-04-17 with v0.9.6-rc.2
DATE: 2017-04-17-14:01:53-PDT
KUBE-AWS-VERSION: v0.9.6-rc.2
I just validated that when I bump back down to 0.9.5 and strip out some of the newer 0.9.6-specific configuration options, I'm able to deploy a healthy working kube cluster again:
kube-aws version && helm version && kk get po
kube-aws version v0.9.5
Client: &version.Version{SemVer:"v2.3.1", GitCommit:"32562a3040bb5ca690339b9840b6f60f8ce25da4", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.3.1", GitCommit:"32562a3040bb5ca690339b9840b6f60f8ce25da4", GitTreeState:"clean"}
NAME READY STATUS RESTARTS AGE
heapster-v1.3.0-567306696-vvlxn 2/2 Running 0 8m
kube-apiserver-ip-10-1-15-60.us-west-1.compute.internal 1/1 Running 0 17m
kube-controller-manager-ip-10-1-15-60.us-west-1.compute.internal 1/1 Running 0 17m
kube-dns-782804071-8pzg8 4/4 Running 0 8m
kube-dns-782804071-jcqgg 4/4 Running 0 17m
kube-dns-autoscaler-2813114833-v9jw3 1/1 Running 0 17m
kube-proxy-ip-10-1-14-164.us-west-1.compute.internal 1/1 Running 0 8m
kube-proxy-ip-10-1-15-201.us-west-1.compute.internal 1/1 Running 0 8m
kube-proxy-ip-10-1-15-60.us-west-1.compute.internal 1/1 Running 0 17m
kube-scheduler-ip-10-1-15-60.us-west-1.compute.internal 1/1 Running 0 16m
kubernetes-dashboard-v1.5.1-b3qbp 1/1 Running 0 17m
tiller-deploy-3067024529-8lfts 1/1 Running 0 1m
@mumoshu I am using flannel. However, as I noted above, my problem only happens if I chose Kubernetes 1.5.x. I dont see the issue with 1.6.x . Based on your comments, it sounds like 1.5.x is not supported in 0.9.6+ This is a bummer since (at the moment) spinnaker does not support Kubernetes 1.6.x thanks to kubernetes API changes. So I guess I will be stuck using kube-aws 0.9.5 + kubernetes 1.5.6
@mumoshu not using Calico. Looks like these are just similar issues but different ones. I'll open another issue for my scenario once I understand a bit more about what is happening.
@jcastillo-cb Thanks for the confirmation! I haven't tried myself but if you're on kube-aws v0.9.6 + k8s v1.5.x + flannel, only thing which would affect you is the default etcd version changed from 2.x to 3.x. Would you mind adding the below in your cluster.yaml and see if it works?
etcd:
version: 2
@cmcconnell1 Thanks for the info! Your detailed report, and my own experience arose today, seems to reveal the root cause of our issues.
For me, one of nodes has been consistently NotReady for days and kube-dns has been somehow trapped into CrashLookBackOff: https://gist.github.com/mumoshu/2e0b9e6887a85ad542c83c2b9745b9d3 Interestingly, the NotReady node had the private IP of an etcd node ❗️
I guess you've recreated problematic clusters with the same clusterName used before for the successful cluster?
If so, I believe our problems is caused by automatic etcd cluster restoration from etcd snapshots.
Automatic etcd snapshots are persisted under a directory prefixed with the S3 uri specified by --s3-uri
.
Can you clear your S3 bucket specified in --s3-uri
and try recreating your cluster?
kube 1.6.2 was just released with following fix which again might have something to do with your kube-dns misbehaviour: https://github.com/kubernetes/kubernetes/pull/44102
TLDR; wrong secret might be mounted as a service account key
Hey @mumoshu / All
I tried a couple of different slight tweaks and re-deploys, using most recent v0.9.6-rc.2
and seems that as @mumoshu noted above, having cruft in the clusters S3 bucket seems to have caused blocking issues/failures for me.
Interestingly for me, when I tried deploying with v0.9.6-rc.2
and configuring etcd to version: 2
, I got a strange failure about the controller failing and that resulted in a failed cnf stack, etc.:
Creating AWS resources. Please wait. It may take a few minutes. Error: Error creating cluster: Stack creation failed: CREATE_FAILED : The following resource(s) failed to create: [Controlplane].
the cfn stack failure/barf:
Printing the most recent failed stack events:
CREATE_FAILED AWS::CloudFormation::Stack cmcc-prod The following resource(s) failed to create: [Controlplane].
CREATE_FAILED AWS::CloudFormation::Stack Controlplane Embedded stack arn:aws:cloudformation:us-west-1:076613928512:stack/cmcc-prod-Controlplane-1R446B6CG6PXF/81d62830-2604-11e7-8466-500cf8eeb88d was not successfully created: The following resource(s) failed to create: [Controllers].
Sadly I seem to have missed grabbing the part of the error that meant something, recall it was due to not being able to maintain the required number of nodes for the ASG I will post my cluster.yaml file in a gist which is working now for me YMMV.
But what is curious is that the above error was with using the exact same controller ASG settings that I've been using for many previous releases (and the same settings which worked successfully right after this failure by simply omitting all etcd version specifications in the cluster.yaml file). It seems to me to likely be an issue with etcd version 2 and latest kube version? But I can't really afford to spend more time on this.
So, to summarize, I was not able to deploy with most recent v0.9.6-rc.2
and etcd: version: 2
But perhaps more importantly, I was able to deploy successfully with latest v0.9.6-rc.2
by simply commenting out any/all etcd version specifications and starting with a clean S3 bucket and new cluster name.
What I am not able to explain is that numerous kube-aws destroy and redeploys are now successful. So perhaps something got corrupt in the S3 bucket data that once purged solves the problem or it was random or intermittent or something else got fixed that I don't have visibility into? Not sure, but I would recommend those who have issues to do as @mumoshu asked and purge the S3 bucket and redeploy as I'm good with the lastet kube-aws version.
On that note, I've deployed mulitple times now on most recent and cannot reproduce the failure with the crashed pods, etc.
cat cmcc-prod-deployment-stats (shows previous cluster deploys and metadata)
##### Kube Cluster Provision Details #####
KUBE-CLUSTER-NAME: cmcc-prod
DATE: 2017-04-20-12:09:14-PDT
KUBE-AWS-VERSION: v0.9.6-rc.2
AMI-ID:
##########################################
##### Kube Cluster Provision Details #####
KUBE-CLUSTER-NAME: cmcc-prod
DATE: 2017-04-20-13:04:13-PDT
KUBE-AWS-VERSION: v0.9.6-rc.2
AMI-ID:
##########################################
##### Kube Cluster Provision Details #####
KUBE-CLUSTER-NAME: cmcc-prod
DATE: 2017-04-20-16:14:36-PDT
KUBE-AWS-VERSION: v0.9.5
AMI-ID:
##########################################
aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name]' --output text | column -t | grep -i "${cluster_name}" | grep 'running'
i-0d4bbfc2ef4adaf2c cmcc-prod-control-plane-kube-aws-etcd-1 running
i-0067874089f39e43d cmcc-prod-cmcc-prod-1a-kube-aws-worker running
i-07b1e492f2121adf0 cmcc-prod-control-plane-kube-aws-controller running
i-0d33c9aec724a15cf cmcc-prod-control-plane-kube-aws-etcd-2 running
i-0319789f350feaadc cmcc-prod-cmcc-prod-1b-kube-aws-worker running
i-0998ae82d5f8e3c5a cmcc-prod-control-plane-kube-aws-etcd-0 running
kk version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T22:51:36Z", GoVersion:"go1.8.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1+coreos.0", GitCommit:"9212f77ed8c169a0afa02e58dce87913c6387b3e", GitTreeState:"clean", BuildDate:"2017-04-04T00:32:53Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
kk get po
NAME READY STATUS RESTARTS AGE
heapster-v1.3.0-76786035-x4s8f 2/2 Running 0 29m
kube-apiserver-ip-10-1-21-46.us-west-1.compute.internal 1/1 Running 0 28m
kube-controller-manager-ip-10-1-21-46.us-west-1.compute.internal 1/1 Running 0 28m
kube-dns-3816048056-ld0mm 4/4 Running 0 29m
kube-dns-3816048056-pb8zf 4/4 Running 0 29m
kube-dns-autoscaler-1464605019-96cvf 1/1 Running 0 29m
kube-proxy-ip-10-1-20-209.us-west-1.compute.internal 1/1 Running 0 22m
kube-proxy-ip-10-1-21-253.us-west-1.compute.internal 1/1 Running 0 22m
kube-proxy-ip-10-1-21-46.us-west-1.compute.internal 1/1 Running 0 30m
kube-scheduler-ip-10-1-21-46.us-west-1.compute.internal 1/1 Running 0 28m
kubernetes-dashboard-v1.5.1-61mnn 1/1 Running 0 29m
tiller-deploy-1332173772-s2ghj 1/1 Running 0 29m
will scrub my cluster.yaml file and post as well in an update.
@mumoshu I tried using version: 2
for the etcd and the cluster fails to start. The controller's ASG fails to create/signal back.
Using basically the same cluster.yml I tried the following scenarios:
version: 2
= Controller ASG fails to startLet me know what logs/info you need to help solve this issue. I'm kinda stuck with 1.5.x at the moment so can't really use 1.6.x yet.
@jrcast , if you are stuck with 1.5 you might be interested in https://github.com/kubernetes-incubator/kube-aws/issues/599
This looks fixed; hasn't happened to me in a while now. Closing!
I had the same kube-dns problems:
Error getting metrics from dnsmasq: read udp 127.0.0.1:40827->127.0.0.1:53: read: connection refused
A cluster destroy, purge of s3 bucket and cluster recreation did the trick. Cluster is now finally running. On to the next certificate error...
On latest master as of right now; my new cluster has kube-dns and kubernetes-dashboard fail to create. The error is that it's getting access denied accessing the cluster API, perhaps the service account token was not available when the pod was created because I can delete the pods and they come up ready the next time. Same for kubernetes-dashboard. This is easy to reproduce it happens every time.