kubernetes-retired / kube-aws

[EOL] A command-line tool to declaratively manage Kubernetes clusters on AWS
Apache License 2.0
1.12k stars 295 forks source link

kube-dns still not coming up on cluster boot #532

Closed jeremyd closed 7 years ago

jeremyd commented 7 years ago

On latest master as of right now; my new cluster has kube-dns and kubernetes-dashboard fail to create. The error is that it's getting access denied accessing the cluster API, perhaps the service account token was not available when the pod was created because I can delete the pods and they come up ready the next time. Same for kubernetes-dashboard. This is easy to reproduce it happens every time.

redbaron commented 7 years ago

do you have kubectl describe output for a failing pod?

jeremyd commented 7 years ago

Crap, I just went to find one on my test cluster and the cluster actually had it running.. So must be intermittant behavior. I'll recycle and see if I can collect more info.

jeremyd commented 7 years ago
This is strange, it happened again for kube-dashboard, this time though killing it didn't allow it to start. ALSO, kube-dns looked like it was running but then it failed about 10 min later... same problem. Pasting both types below. Describe pod dashboard: ``` $ kubectl describe pod kubernetes-dashboard-v1.5.1-0hlz3 --namespace=kube-system Name: kubernetes-dashboard-v1.5.1-0hlz3 Namespace: kube-system Node: ip-10-100-102-133.ec2.internal/10.100.102.133 Start Time: Wed, 12 Apr 2017 15:52:53 -0700 Labels: k8s-app=kubernetes-dashboard kubernetes.io/cluster-service=true version=v1.5.1 Status: Running IP: 10.2.103.2 Controllers: ReplicationController/kubernetes-dashboard-v1.5.1 Containers: kubernetes-dashboard: Container ID: docker://3dd4dcf255439d41504c13e48e25d10cb70b6e04e673cc33423e4072cae48145 Image: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.5.1 Image ID: docker-pullable://gcr.io/google_containers/kubernetes-dashboard-amd64@sha256:46a09eb9c611e625e7de3fcf325cf78e629d002e57dc80348e9b0638338206b5 Port: 9090/TCP Limits: cpu: 100m memory: 50Mi Requests: cpu: 100m memory: 50Mi State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Wed, 12 Apr 2017 15:56:18 -0700 Ready: False Restart Count: 5 Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-1r1j5 (ro) Environment Variables: Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: default-token-1r1j5: Type: Secret (a volume populated by a Secret) SecretName: default-token-1r1j5 QoS Class: Guaranteed Tolerations: Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 43m 41m 14 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 16m 16m 3 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (6). 16m 16m 3 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (7). 16m 16m 2 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (8). 16m 13m 9 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (9). 13m 12m 3 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (4). 11m 11m 2 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 11m 6m 22 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 6m 5m 6 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 5m 5m 1 {default-scheduler } Normal Scheduled Successfully assigned kubernetes-dashboard-v1.5.1-0hlz3 to ip-10-100-102-133.ec2.internal 5m 5m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Pulling pulling image "gcr.io/google_containers/kubernetes-dashboard-amd64:v1.5.1" 5m 5m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Pulled Successfully pulled image "gcr.io/google_containers/kubernetes-dashboard-amd64:v1.5.1" 5m 5m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Created Created container with id b843d7ab4637320aca21ab29888eb366bb3067db4bd80e10e81b71c55bd4892f 5m 5m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Started Started container with id b843d7ab4637320aca21ab29888eb366bb3067db4bd80e10e81b71c55bd4892f 5m 5m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Created Created container with id 46d9252a882a730c44d4c16837611747fe862715b4992b330e2502811da68d2c 4m 4m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Started Started container with id 46d9252a882a730c44d4c16837611747fe862715b4992b330e2502811da68d2c 4m 4m 2 {kubelet ip-10-100-102-133.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "kubernetes-dashboard" with CrashLoopBackOff: "Back-off 10s restarting failed container=kubernetes-dashboard pod=kubernetes-dashboard-v1.5.1-0hlz3_kube-system(611a2ed0-1fcd-11e7-b42a-0eeacc5dd96c)" 4m 4m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Created Created container with id c83d1122efd51d19820716ed05dbe2680c50d158dcb14936fea921ef917c0106 4m 4m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Started Started container with id c83d1122efd51d19820716ed05dbe2680c50d158dcb14936fea921ef917c0106 4m 4m 3 {kubelet ip-10-100-102-133.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "kubernetes-dashboard" with CrashLoopBackOff: "Back-off 20s restarting failed container=kubernetes-dashboard pod=kubernetes-dashboard-v1.5.1-0hlz3_kube-system(611a2ed0-1fcd-11e7-b42a-0eeacc5dd96c)" 4m 4m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Started Started container with id 97df9e423ef1d77d3880fea9d5049c20145cd0f85070c5491fd159066b2a8742 4m 4m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Created Created container with id 97df9e423ef1d77d3880fea9d5049c20145cd0f85070c5491fd159066b2a8742 4m 3m 4 {kubelet ip-10-100-102-133.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "kubernetes-dashboard" with CrashLoopBackOff: "Back-off 40s restarting failed container=kubernetes-dashboard pod=kubernetes-dashboard-v1.5.1-0hlz3_kube-system(611a2ed0-1fcd-11e7-b42a-0eeacc5dd96c)" 3m 3m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Started Started container with id ec52e63917c78c34b10c040a3f57a57201c5f422031e536d77ce0ff670428c2c 3m 3m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Created Created container with id ec52e63917c78c34b10c040a3f57a57201c5f422031e536d77ce0ff670428c2c 3m 2m 6 {kubelet ip-10-100-102-133.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "kubernetes-dashboard" with CrashLoopBackOff: "Back-off 1m20s restarting failed container=kubernetes-dashboard pod=kubernetes-dashboard-v1.5.1-0hlz3_kube-system(611a2ed0-1fcd-11e7-b42a-0eeacc5dd96c)" 5m 2m 5 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Pulled Container image "gcr.io/google_containers/kubernetes-dashboard-amd64:v1.5.1" already present on machine 2m 2m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Started Started container with id 3dd4dcf255439d41504c13e48e25d10cb70b6e04e673cc33423e4072cae48145 2m 2m 1 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Normal Created Created container with id 3dd4dcf255439d41504c13e48e25d10cb70b6e04e673cc33423e4072cae48145 4m 1s 26 {kubelet ip-10-100-102-133.ec2.internal} spec.containers{kubernetes-dashboard} Warning BackOff Back-off restarting failed container 2m 1s 11 {kubelet ip-10-100-102-133.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "kubernetes-dashboard" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=kubernetes-dashboard pod=kubernetes-dashboard-v1.5.1-0hlz3_kube-system(611a2ed0-1fcd-11e7-b42a-0eeacc5dd96c)" ``` pod logs ``` kubectl logs kubernetes-dashboard-v1.5.1-0hlz3 --namespace=kube-system Using HTTP port: 9090 Creating API server client for https://10.3.0.1:443 Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: the server has asked for the client to provide credentials Refer to the troubleshooting guide for more information: https://github.com/kubernetes/dashboard/blob/master/docs/user-guide/troubleshooting.md ``` kubectl get secrets --namespace=kube-system ``` NAME TYPE DATA AGE default-token-1r1j5 kubernetes.io/service-account-token 3 2d ``` # DNS pod logs ``` E0412 23:04:17.907621 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services) E0412 23:04:18.907888 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints) E0412 23:04:18.909538 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services) E0412 23:04:18.909583 1 reflector.go:199] pkg/dns/config/sync.go:114: Failed to list *api.ConfigMap: the server has asked for the client to provide credentials (get configmaps) E0412 23:04:19.909981 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints) E0412 23:04:19.911762 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services) E0412 23:04:19.911825 1 reflector.go:199] pkg/dns/config/sync.go:114: Failed to list *api.ConfigMap: the server has asked for the client to provide credentials (get configmaps) E0412 23:04:20.911938 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints) E0412 23:04:20.913616 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services) E0412 23:04:20.913634 1 reflector.go:199] pkg/dns/config/sync.go:114: Failed to list *api.ConfigMap: the server has asked for the client to provide credentials (get configmaps) E0412 23:04:21.913755 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints) E0412 23:04:21.915503 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services) E0412 23:04:21.915544 1 reflector.go:199] pkg/dns/config/sync.go:114: Failed to list *api.ConfigMap: the server has asked for the client to provide credentials (get configmaps) E0412 23:04:22.915740 1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints) E0412 23:04:22.917447 1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services) E0412 23:04:22.917711 1 reflector.go:199] pkg/dns/config/sync.go:114: Failed to list *api.ConfigMap: the server has asked for the client to provide credentials (get configmaps) ``` describe pod for kube-dns ``` $ kubectl describe pod kube-dns-3816048056-xcvn4 --namespace=kube-system Name: kube-dns-3816048056-xcvn4 Namespace: kube-system Node: ip-10-100-100-205.ec2.internal/10.100.100.205 Start Time: Wed, 12 Apr 2017 15:52:53 -0700 Labels: k8s-app=kube-dns pod-template-hash=3816048056 Status: Running IP: 10.2.27.2 Controllers: ReplicaSet/kube-dns-3816048056 Containers: kubedns: Container ID: docker://5b92587d9ad9017a7630d14fd0eb8e2634a3124e53e3282a904bafa966b7f770 Image: gcr.io/google_containers/kubedns-amd64:1.9 Image ID: docker-pullable://gcr.io/google_containers/kubedns-amd64@sha256:3d3d67f519300af646e00adcf860b2f380d35ed4364e550d74002dadace20ead Ports: 10053/UDP, 10053/TCP, 10055/TCP Args: --domain=cluster.local. --dns-port=10053 --config-map=kube-dns --v=2 Limits: memory: 170Mi Requests: cpu: 100m memory: 70Mi State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Wed, 12 Apr 2017 16:06:43 -0700 Ready: False Restart Count: 6 Liveness: http-get http://:8080/healthz-kubedns delay=60s timeout=5s period=10s #success=1 #failure=5 Readiness: http-get http://:8081/readiness delay=3s timeout=5s period=10s #success=1 #failure=3 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-1r1j5 (ro) Environment Variables: PROMETHEUS_PORT: 10055 dnsmasq: Container ID: docker://e61f45aff23e5423f671f6331bfe771488da16e470f28072ebc76a09b72ec156 Image: gcr.io/google_containers/kube-dnsmasq-amd64:1.4 Image ID: docker-pullable://gcr.io/google_containers/kube-dnsmasq-amd64@sha256:a722df15c0cf87779aad8ba2468cf072dd208cb5d7cfcaedd90e66b3da9ea9d2 Ports: 53/UDP, 53/TCP Args: --cache-size=1000 --no-resolv --server=127.0.0.1#10053 --log-facility=- Requests: cpu: 150m memory: 10Mi State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Wed, 12 Apr 2017 16:06:03 -0700 Ready: False Restart Count: 6 Liveness: http-get http://:8080/healthz-dnsmasq delay=60s timeout=5s period=10s #success=1 #failure=5 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-1r1j5 (ro) Environment Variables: dnsmasq-metrics: Container ID: docker://fb164bfa3b8e637d7d5412372d4aa67e85e27b468ea6da1530ca66ea48fc4037 Image: gcr.io/google_containers/dnsmasq-metrics-amd64:1.0 Image ID: docker-pullable://gcr.io/google_containers/dnsmasq-metrics-amd64@sha256:4063e37fd9b2fd91b7cc5392ed32b30b9c8162c4c7ad2787624306fc133e80a9 Port: 10054/TCP Args: --v=2 --logtostderr Requests: memory: 10Mi State: Running Started: Wed, 12 Apr 2017 15:53:17 -0700 Ready: True Restart Count: 0 Liveness: http-get http://:10054/metrics delay=60s timeout=5s period=10s #success=1 #failure=5 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-1r1j5 (ro) Environment Variables: healthz: Container ID: docker://74d2395dc35dfe42c2a83f52958a6f373c445d7c57e3e646449ffbfd33272870 Image: gcr.io/google_containers/exechealthz-amd64:1.2 Image ID: docker-pullable://gcr.io/google_containers/exechealthz-amd64@sha256:503e158c3f65ed7399f54010571c7c977ade7fe59010695f48d9650d83488c0a Port: 8080/TCP Args: --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null --url=/healthz-dnsmasq --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null --url=/healthz-kubedns --port=8080 --quiet Limits: memory: 50Mi Requests: cpu: 10m memory: 50Mi State: Running Started: Wed, 12 Apr 2017 15:53:18 -0700 Ready: True Restart Count: 0 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-1r1j5 (ro) Environment Variables: Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: default-token-1r1j5: Type: Secret (a volume populated by a Secret) SecretName: default-token-1r1j5 QoS Class: Burstable Tolerations: Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 54m 51m 14 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 26m 26m 3 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (6). 26m 26m 3 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (7). 26m 26m 2 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (8). 26m 24m 9 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (9). 23m 22m 3 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (4). 21m 21m 2 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 21m 16m 21 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 16m 16m 6 {default-scheduler } Warning FailedScheduling No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (5). 15m 15m 1 {default-scheduler } Normal Scheduled Successfully assigned kube-dns-3816048056-xcvn4 to ip-10-100-100-205.ec2.internal 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Pulling pulling image "gcr.io/google_containers/kubedns-amd64:1.9" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Pulled Successfully pulled image "gcr.io/google_containers/kubedns-amd64:1.9" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Created Created container with id 66e0665c3ec65a15c0000731bc6f5a307f6ab212937fc914e8a9ee09d7c5437f 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Pulling pulling image "gcr.io/google_containers/kube-dnsmasq-amd64:1.4" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Started Started container with id 66e0665c3ec65a15c0000731bc6f5a307f6ab212937fc914e8a9ee09d7c5437f 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Pulled Successfully pulled image "gcr.io/google_containers/kube-dnsmasq-amd64:1.4" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Created Created container with id 8e520f10279dfc40511dc53de8c72f506d995bf5848b4b2fbe859ea3dc0f117d 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Started Started container with id 8e520f10279dfc40511dc53de8c72f506d995bf5848b4b2fbe859ea3dc0f117d 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq-metrics} Normal Pulling pulling image "gcr.io/google_containers/dnsmasq-metrics-amd64:1.0" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq-metrics} Normal Pulled Successfully pulled image "gcr.io/google_containers/dnsmasq-metrics-amd64:1.0" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq-metrics} Normal Created Created container with id fb164bfa3b8e637d7d5412372d4aa67e85e27b468ea6da1530ca66ea48fc4037 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq-metrics} Normal Started Started container with id fb164bfa3b8e637d7d5412372d4aa67e85e27b468ea6da1530ca66ea48fc4037 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{healthz} Normal Pulling pulling image "gcr.io/google_containers/exechealthz-amd64:1.2" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{healthz} Normal Pulled Successfully pulled image "gcr.io/google_containers/exechealthz-amd64:1.2" 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{healthz} Normal Created Created container with id 74d2395dc35dfe42c2a83f52958a6f373c445d7c57e3e646449ffbfd33272870 15m 15m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{healthz} Normal Started Started container with id 74d2395dc35dfe42c2a83f52958a6f373c445d7c57e3e646449ffbfd33272870 12m 12m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Killing Killing container with id docker://8e520f10279dfc40511dc53de8c72f506d995bf5848b4b2fbe859ea3dc0f117d:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "dnsmasq" is unhealthy, it will be killed and re-created. 12m 12m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Created Created container with id 3b96f208e963e25469d80d2068f4485ca20f67707db69519b7407e4a1385e55d 12m 12m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Started Started container with id 3b96f208e963e25469d80d2068f4485ca20f67707db69519b7407e4a1385e55d 12m 12m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Created Created container with id 2a29000b8e13521e5c100164098749430bb843d1e855e25050b08bdaeb37c565 12m 12m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Started Started container with id 2a29000b8e13521e5c100164098749430bb843d1e855e25050b08bdaeb37c565 12m 12m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Killing Killing container with id docker://66e0665c3ec65a15c0000731bc6f5a307f6ab212937fc914e8a9ee09d7c5437f:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "kubedns" is unhealthy, it will be killed and re-created. 10m 10m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Killing Killing container with id docker://3b96f208e963e25469d80d2068f4485ca20f67707db69519b7407e4a1385e55d:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "dnsmasq" is unhealthy, it will be killed and re-created. 10m 10m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Created Created container with id d6071e182d09b956eff01c80e07eeb3a7d2e6faa8eb433391eef6416cb480310 10m 10m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Killing Killing container with id docker://2a29000b8e13521e5c100164098749430bb843d1e855e25050b08bdaeb37c565:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "kubedns" is unhealthy, it will be killed and re-created. 10m 10m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Created Created container with id 6c4c5a7fe8e3cf5ed108d033edac5d55eae892f53b21980259fcda60bfa7000f 10m 10m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Started Started container with id 6c4c5a7fe8e3cf5ed108d033edac5d55eae892f53b21980259fcda60bfa7000f 10m 10m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Started Started container with id d6071e182d09b956eff01c80e07eeb3a7d2e6faa8eb433391eef6416cb480310 8m 8m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Killing Killing container with id docker://d6071e182d09b956eff01c80e07eeb3a7d2e6faa8eb433391eef6416cb480310:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "dnsmasq" is unhealthy, it will be killed and re-created. 8m 8m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Started Started container with id c81d038dee896a6c2642b8dd7d315f0327880fec47664127e47c2703f3ef9268 8m 8m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Killing Killing container with id docker://6c4c5a7fe8e3cf5ed108d033edac5d55eae892f53b21980259fcda60bfa7000f:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "kubedns" is unhealthy, it will be killed and re-created. 8m 8m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Created Created container with id c81d038dee896a6c2642b8dd7d315f0327880fec47664127e47c2703f3ef9268 6m 6m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Killing Killing container with id docker://c81d038dee896a6c2642b8dd7d315f0327880fec47664127e47c2703f3ef9268:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "dnsmasq" is unhealthy, it will be killed and re-created. 6m 6m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Killing Killing container with id docker://10e9884bdb7f186e49c6dc274172347c414ae84060479f4d1d0b436ec40ef25c:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "kubedns" is unhealthy, it will be killed and re-created. 5m 5m 1 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Killing Killing container with id docker://6536dfd1e93254093624fe764552d024f4ada98a9f3fed0d65e12cb83d756e4d:pod "kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" container "dnsmasq" is unhealthy, it will be killed and re-created. 12m 3m 6 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Normal Pulled Container image "gcr.io/google_containers/kube-dnsmasq-amd64:1.4" already present on machine 8m 3m 7 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Created (events with common reason combined) 8m 3m 7 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Started (events with common reason combined) 12m 3m 6 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Pulled Container image "gcr.io/google_containers/kubedns-amd64:1.9" already present on machine 2m 2m 2 {kubelet ip-10-100-100-205.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "dnsmasq" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=dnsmasq pod=kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" 1m 1m 1 {kubelet ip-10-100-100-205.ec2.internal} Warning FailedSync Error syncing pod, skipping: [failed to "StartContainer" for "dnsmasq" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=dnsmasq pod=kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" , failed to "StartContainer" for "kubedns" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=kubedns pod=kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" ] 4m 1m 5 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{kubedns} Normal Killing (events with common reason combined) 2m 1s 22 {kubelet ip-10-100-100-205.ec2.internal} spec.containers{dnsmasq} Warning BackOff Back-off restarting failed container 1m 1s 9 {kubelet ip-10-100-100-205.ec2.internal} Warning FailedSync Error syncing pod, skipping: [failed to "StartContainer" for "kubedns" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=kubedns pod=kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" , failed to "StartContainer" for "dnsmasq" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=dnsmasq pod=kube-dns-3816048056-xcvn4_kube-system(6121f1a4-1fcd-11e7-b42a-0eeacc5dd96c)" ```
mumoshu commented 7 years ago

@jeremyd Would you mind sharing me your cluster.yaml to reproduce the issue?

jrcast commented 7 years ago

kube-dns and kubernetes-dashboard tree fail to create for me as well on the latest kube-aws 0.9.6-rc1 when I try to use kubernetes older than 1.6. Unlike the situation described by @jeremyd, my pods don't even create, so I cannot do kubectl describe.

These are the pods I get upon cluster start:

[cbdev@localhost kubernetes]$ kubectl   get pods --all-namespaces
NAMESPACE     NAME                                                    READY     STATUS    RESTARTS   AGE
kube-system   kube-apiserver-ip-172-1-200-201.ec2.internal            1/1       Running   0          12m
kube-system   kube-apiserver-ip-172-1-201-212.ec2.internal            1/1       Running   0          12m
kube-system   kube-apiserver-ip-172-1-202-22.ec2.internal             1/1       Running   0          13m
kube-system   kube-controller-manager-ip-172-1-200-201.ec2.internal   1/1       Running   0          13m
kube-system   kube-controller-manager-ip-172-1-201-212.ec2.internal   1/1       Running   1          13m
kube-system   kube-controller-manager-ip-172-1-202-22.ec2.internal    1/1       Running   1          13m
kube-system   kube-proxy-ip-172-1-200-114.ec2.internal                1/1       Running   0          7m
kube-system   kube-proxy-ip-172-1-200-147.ec2.internal                1/1       Running   0          6m
kube-system   kube-proxy-ip-172-1-200-201.ec2.internal                1/1       Running   0          13m
kube-system   kube-proxy-ip-172-1-200-247.ec2.internal                1/1       Running   0          6m
kube-system   kube-proxy-ip-172-1-200-252.ec2.internal                1/1       Running   0          6m
kube-system   kube-proxy-ip-172-1-200-50.ec2.internal                 1/1       Running   0          7m
kube-system   kube-proxy-ip-172-1-201-170.ec2.internal                1/1       Running   0          6m
kube-system   kube-proxy-ip-172-1-201-212.ec2.internal                1/1       Running   0          13m
kube-system   kube-proxy-ip-172-1-201-221.ec2.internal                1/1       Running   0          7m
kube-system   kube-proxy-ip-172-1-201-240.ec2.internal                1/1       Running   0          6m
kube-system   kube-proxy-ip-172-1-201-38.ec2.internal                 1/1       Running   0          6m
kube-system   kube-proxy-ip-172-1-201-54.ec2.internal                 1/1       Running   0          6m
kube-system   kube-proxy-ip-172-1-202-156.ec2.internal                1/1       Running   0          7m
kube-system   kube-proxy-ip-172-1-202-169.ec2.internal                1/1       Running   0          7m
kube-system   kube-proxy-ip-172-1-202-191.ec2.internal                1/1       Running   0          7m
kube-system   kube-proxy-ip-172-1-202-22.ec2.internal                 1/1       Running   0          12m
kube-system   kube-proxy-ip-172-1-202-65.ec2.internal                 1/1       Running   0          7m
kube-system   kube-proxy-ip-172-1-202-98.ec2.internal                 1/1       Running   0          6m
kube-system   kube-scheduler-ip-172-1-200-201.ec2.internal            1/1       Running   0          12m
kube-system   kube-scheduler-ip-172-1-201-212.ec2.internal            1/1       Running   1          13m
kube-system   kube-scheduler-ip-172-1-202-22.ec2.internal             1/1       Running   1          12m`

I tried 1.5.5 and 1.5.6 and both fail to create. This problem doesn't seem to happen for me when I use kubernetes 1.6.1

I have a hard limitation at the moment so cannot use kubernetes +1.6. So I downgraded to kube-aws 0.9.5 and I can use kubernetes 1.5.5 with no issues.

jeremyd commented 7 years ago

This issue is intermittent. We finally got a cluster booted up and DNS stayed up and it's looking good. (after failing 3 times or so with the same config). I tried 1.6.1 and 1.5.6 and they both had the issue so I think it's just an intermittent problem for both versions. Here's my cluster config @mumoshu https://gist.github.com/jeremyd/9d6b86aeacbaf7f6044f9497aecfba72

cknowles commented 7 years ago

Experienced this twice in a row on master 30dccf548e78fc8e96569fbf680c7098dff3f771. It looks like a couple of separate issues but maybe inter-related. Both dashboard and some DNS pods are in the CrashLoopBackOff state. I think it's related to usage of apiEndpoints in my case.

✗ stern --namespace=kube-system kubernetes-dashboard
+ kubernetes-dashboard-v1.5.1-99tck › kubernetes-dashboard
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Using HTTP port: 9090
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Creating API server client for https://10.3.0.1:443
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: the server has asked for the client to provide credentials
kubernetes-dashboard-v1.5.1-99tck kubernetes-dashboard Refer to the troubleshooting guide for more information: https://github.com/kubernetes/dashboard/blob/master/docs/user-guide/troubleshooting.md
✗ stern --namespace=kube-system kube-dns       
+ kube-dns-3816048056-lmnrn › dnsmasq-metrics
+ kube-dns-3816048056-lmnrn › healthz
+ kube-dns-3816048056-lmnrn › kubedns
+ kube-dns-autoscaler-1464605019-p8ds3 › autoscaler
[...]
kube-dns-3816048056-lmnrn dnsmasq-metrics ERROR: logging before flag.Parse: I0419 07:36:06.595875       1 main.go:38] dnsmasq-metrics v1.0
kube-dns-3816048056-lmnrn dnsmasq-metrics ERROR: logging before flag.Parse: I0419 07:36:06.596209       1 server.go:44] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:dnsmasq PrometheusSubsystem:cache})
kube-dns-3816048056-lmnrn dnsmasq-metrics ERROR: logging before flag.Parse: W0419 07:36:06.596952       1 server.go:53] Error getting metrics from dnsmasq: read udp 127.0.0.1:36209->127.0.0.1:53: read: connection refused
[...]
kube-dns-3816048056-lmnrn healthz 2017/04/19 07:37:23 Healthz probe on /healthz-kubedns error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:37:22.670736899 +0000 UTC, error exit status 1
kube-dns-3816048056-lmnrn healthz 2017/04/19 07:37:23 Healthz probe on /healthz-dnsmasq error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:37:22.670780949 +0000 UTC, error exit status 1
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.333018       1 dns.go:42] version: v1.6.0-alpha.0.680+3872cb93abf948-dirty
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335187       1 server.go:107] Using https://10.3.0.1:443 for kubernetes master, kubernetes API: <nil>
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:41:32.670965404 +0000 UTC, error exit status 1
kube-dns-3816048056-lmnrn healthz , at 2017-04-19 07:42:12.671813145 +0000 UTC, ekube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335697       1 server.go:68] Using configuration read from ConfigMap: kube-system:kube-dns
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335740       1 server.go:113] FLAG: --alsologtostderr="false"
rror exit status 1
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335760       1 server.go:113] FLAG: --config-map="kube-dns"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335768       1 server.go:113] FLAG: --config-map-namespace="kube-system"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335773       1 server.go:113] FLAG: --dns-bind-address="0.0.0.0"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335778       1 server.go:113] FLAG: --dns-port="10053"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335788       1 server.go:113] FLAG: --domain="cluster.local."
[...]
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335798       1 server.go:113] FLAG: --federations=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335804       1 server.go:113] FLAG: --healthz-port="8081"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335810       1 server.go:113] FLAG: --kube-master-url=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335818       1 server.go:113] FLAG: --kubecfg-file=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335823       1 server.go:113] FLAG: --log-backtrace-at=":0"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335831       1 server.go:113] FLAG: --log-dir=""
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335838       1 server.go:113] FLAG: --log-flush-frequency="5s"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335846       1 server.go:113] FLAG: --logtostderr="true"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335852       1 server.go:113] FLAG: --stderrthreshold="2"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335859       1 server.go:113] FLAG: --v="2"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335864       1 server.go:113] FLAG: --version="false"
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.335877       1 server.go:113] FLAG: --vmodule=""
[...]
kube-dns-3816048056-lmnrn dnsmasq-metrics 2017/04/19 07:37:23 Healthz probe on /healthz-kubedns error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn dnsmasq-metrics 2017/04/19 07:37:23 Healthz probe on /healthz-kubedns error: Result of last exec: nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336034       1 server.go:155] Starting SkyDNS server (0.0.0.0:10053)
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336333       1 server.go:165] Skydns metrics enabled (/metrics:10055)
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336349       1 dns.go:144] Starting endpointsController
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336354       1 dns.go:147] Starting serviceController
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336738       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.336760       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.362716       1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints)
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.363270       1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: the server has asked for the client to provide credentials (get services)
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.363661       1 sync.go:105] Error getting ConfigMap kube-system:kube-dns err: the server has asked for the client to provide credentials (get configmaps kube-dns)
kube-dns-3816048056-lmnrn kubedns E0419 08:08:04.363690       1 dns.go:190] Error getting initial ConfigMap: the server has asked for the client to provide credentials (get configmaps kube-dns), starting with default values
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.363719       1 dns.go:163] Waiting for Kubernetes service
kube-dns-3816048056-lmnrn kubedns I0419 08:08:04.363763       1 dns.go:169] Waiting for service: default/kubernetes
[...]
kube-dns-autoscaler-1464605019-p8ds3 autoscaler I0419 07:36:06.676194       1 autoscaler.go:49] Scaling Namespace: kube-system, Target: deployment/kube-dns, Mode: linear
kube-dns-autoscaler-1464605019-p8ds3 autoscaler E0419 07:36:06.710613       1 autoscaler_server.go:96] Error while getting cluster status: the server has asked for the client to provide credentials (get nodes)
kube-dns-autoscaler-1464605019-p8ds3 autoscaler E0419 07:37:36.678574       1 autoscaler
_server.go:96] Error while getting cluster status: the server has asked for the client to provide credentials (get nodes)
cknowles commented 7 years ago

I have 2 controllers, 3 etcd nodes, apiEndpoints config as per https://github.com/kubernetes-incubator/kube-aws/issues/527#issue-220837426

cknowles commented 7 years ago

I've just done a full cluster recreate and seems ok this time. My errors before were after doing a a render of credentials, stack and then update. @jeremyd in your case, you said it was a new cluster to just to confirm it was an entirely new stack using kube-aws up?

cmcconnell1 commented 7 years ago

Hello All, I'm also experiencing the same problems and symptoms as @jeremyd after deploying the latest kube-aws

kube-aws version
kube-aws version v0.9.6-rc.2

What is strange is that I just deployed a cluster a few days ago and it uses the same CoreOS AMI, kube-aws version, cluster.yaml file, etc., so this is a bit curious.

We are using nodepools which is new for us with this and the previous release. The cluster I deployed a few days ago was initially deployed as H/A with multi-AZ nodepools. I then updated that cluster by commenting out one of my nodepools code sections in cluster.yaml (i.e. disable us-west-1b nodepool) and simply ran kube-aws update on it and everything worked perfectly--with cfn tearing down the stack, etc. I saw no errors or issues with that cluster, and checking it just now all kube-system pods appear healthy:

DEV / working cluster deployed a few days ago:

export KUBECONFIG=`pwd`/kubeconfig
SHOW_KUBECONFIG
ops-dev.dev.terradatum.com

kk get po
NAME                                                                READY     STATUS    RESTARTS   AGE
heapster-v1.3.0-76786035-g270k                                      2/2       Running   0          1d
kube-apiserver-ip-10-1-17-152.us-west-1.compute.internal            1/1       Running   0          1d
kube-controller-manager-ip-10-1-17-152.us-west-1.compute.internal   1/1       Running   0          1d
kube-dns-3816048056-5kp1m                                           4/4       Running   0          1d
kube-dns-3816048056-fx3qx                                           4/4       Running   0          1d
kube-dns-autoscaler-1464605019-nsb7m                                1/1       Running   0          1d
kube-proxy-ip-10-1-16-116.us-west-1.compute.internal                1/1       Running   0          1d
kube-proxy-ip-10-1-17-152.us-west-1.compute.internal                1/1       Running   0          1d
kube-scheduler-ip-10-1-17-152.us-west-1.compute.internal            1/1       Running   0          1d
kubernetes-dashboard-v1.5.1-cqr8j                                   1/1       Running   0          1d
tiller-deploy-1172528075-qr2mv                                      1/1       Running   0          1d

Latest sick / sadpanda cluster

export KUBECONFIG=`pwd`/kubeconfig
SHOW_KUBECONFIG
aergo-prod.terradatum.com

kk get po
NAME                                                                READY     STATUS             RESTARTS   AGE
heapster-v1.3.0-268032834-jc740                                     2/2       Running            0          2h
kube-apiserver-ip-10-1-14-191.us-west-1.compute.internal            1/1       Running            0          1h
kube-controller-manager-ip-10-1-14-191.us-west-1.compute.internal   1/1       Running            0          1h
kube-dns-3816048056-6c4x5                                           2/4       CrashLoopBackOff   30         2h
kube-dns-autoscaler-1464605019-q0xlf                                1/1       Running            0          2h
kube-proxy-ip-10-1-14-191.us-west-1.compute.internal                1/1       Running            0          1h
kube-proxy-ip-10-1-14-82.us-west-1.compute.internal                 1/1       Running            0          53m
kube-proxy-ip-10-1-15-50.us-west-1.compute.internal                 1/1       Running            0          53m
kube-scheduler-ip-10-1-14-191.us-west-1.compute.internal            1/1       Running            0          1h
kubernetes-dashboard-v1.5.1-p0m98                                   0/1       CrashLoopBackOff   15         2h
tiller-deploy-1332173772-bxz2j                                      0/1       CrashLoopBackOff   13         42m

This cluster deploy was initially from a blank / clean initial cluster, then each deployment attempt since then has used the same dir, with some kube-aws render stack commands as necessary.

Thanks @jeremyd for posting your cluster.yaml file in the above gist, it's very helpful for me/us to see how others are deploying.

I can post our cluster.yaml file, but again note that it was working a few days ago with previous successful cluster deploy.

kube-dns error logs

redbaron commented 7 years ago

very likely flannel is broken on controller nodes: https://github.com/kubernetes-incubator/kube-aws/pull/558 not sure if it isrelated to the problem you see, but worth checking

jrcast commented 7 years ago

Whatever it is it's only broken when I use 0.9.6+ with kubernetes 1.5.6 or 1.5.5. I tried this today again with 0.9.6rc2 and saw same issue. Issue is not there for 0.9.5

cmcconnell1 commented 7 years ago

Thanks @jcastillo-cb / All, 'Was trying to run with the most recent versions possible, so tried reverting/bumping down one minor RC, to the previous release v0.9.6-rc.1 and as @jcastillo-cb noted above, I can confirm that we're still seeing the problem(s).

kube-aws version
kube-aws version v0.9.6-rc.1

'Tried deleting the pods to no avail. . .

kk get po
NAME                                                                READY     STATUS             RESTARTS   AGE
heapster-v1.3.0-268032834-dnc2g                                     2/2       Running            0          4h
kube-apiserver-ip-10-1-14-156.us-west-1.compute.internal            1/1       Running            0          1h
kube-controller-manager-ip-10-1-14-156.us-west-1.compute.internal   1/1       Running            0          1h
kube-dns-3816048056-6g9wq                                           3/4       Running            6          4h
kube-dns-autoscaler-1464605019-20t6q                                1/1       Running            0          4h
kube-proxy-ip-10-1-14-156.us-west-1.compute.internal                1/1       Running            0          1h
kube-proxy-ip-10-1-14-31.us-west-1.compute.internal                 1/1       Running            0          9m
kube-proxy-ip-10-1-15-87.us-west-1.compute.internal                 1/1       Running            0          9m
kube-scheduler-ip-10-1-14-156.us-west-1.compute.internal            1/1       Running            0          1h
kubernetes-dashboard-v1.5.1-nlvb3                                   0/1       CrashLoopBackOff   2          35s
tiller-deploy-1332173772-smrg1                                      0/1       CrashLoopBackOff   2          35s

So, I will be reverting to 0.9.5 as well. JIC it's helpful here is a gist with all of the errors/logs for all kube-dns containers for i in kubedns dnsmasq dnsmasq-metrics healthz; do kk logs kube-dns-3816048056-6g9wq $i; done all kube-dns containers error/logs

mumoshu commented 7 years ago

@cmcconnell1 @jcastillo-cb @jeremyd @c-knowles Sorry for being a bit late to the party. First of all, please let me clarify that starting v0.9.6-rc.1, k8s older than v1.6.0 is not supported! Would you mind confirming your kubernetes versions and if you're using v1.5.x, would using v1.6.x instead solve your issues?

cknowles commented 7 years ago

In my case above I'm not customising the kube version so it's the default on master - 1.6.1 I believe.

mumoshu commented 7 years ago

@cmcconnell1 @jcastillo-cb @jeremyd @c-knowles Would you mind letting me know if you've enabled Calico on your cluster?

cmcconnell1 commented 7 years ago

Hi @mumoshu / All, I'm not hard coding any specific kubernetes (or other dependencies) versions in my configurations. I typically take the default and then only pin things, such as an AMI if needed--perhaps when things change and I can no longer deploy a cluster without using an older image, etc.

I tried both default most recent kube-aws versions 0.9.6-rc{1,2} and was only able to deploy a healthy / working kube cluster once last week at 2017-04-17-14:01:53-PDT to be exact. But, please note that haven't been able to deploy another cluster on 0.9.6 since sometime after that.
I create wrapper and helper scripts to assist with things like this, so I can go back to that cluster and extract the details of when it was deployed, etc. The below details are from when I was able to deploy a cluster on 2017-04-17 with v0.9.6-rc.2

DATE: 2017-04-17-14:01:53-PDT
KUBE-AWS-VERSION: v0.9.6-rc.2

I just validated that when I bump back down to 0.9.5 and strip out some of the newer 0.9.6-specific configuration options, I'm able to deploy a healthy working kube cluster again:

kube-aws version && helm version && kk get po
kube-aws version v0.9.5
Client: &version.Version{SemVer:"v2.3.1", GitCommit:"32562a3040bb5ca690339b9840b6f60f8ce25da4", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.3.1", GitCommit:"32562a3040bb5ca690339b9840b6f60f8ce25da4", GitTreeState:"clean"}
NAME                                                               READY     STATUS    RESTARTS   AGE
heapster-v1.3.0-567306696-vvlxn                                    2/2       Running   0          8m
kube-apiserver-ip-10-1-15-60.us-west-1.compute.internal            1/1       Running   0          17m
kube-controller-manager-ip-10-1-15-60.us-west-1.compute.internal   1/1       Running   0          17m
kube-dns-782804071-8pzg8                                           4/4       Running   0          8m
kube-dns-782804071-jcqgg                                           4/4       Running   0          17m
kube-dns-autoscaler-2813114833-v9jw3                               1/1       Running   0          17m
kube-proxy-ip-10-1-14-164.us-west-1.compute.internal               1/1       Running   0          8m
kube-proxy-ip-10-1-15-201.us-west-1.compute.internal               1/1       Running   0          8m
kube-proxy-ip-10-1-15-60.us-west-1.compute.internal                1/1       Running   0          17m
kube-scheduler-ip-10-1-15-60.us-west-1.compute.internal            1/1       Running   0          16m
kubernetes-dashboard-v1.5.1-b3qbp                                  1/1       Running   0          17m
tiller-deploy-3067024529-8lfts                                     1/1       Running   0          1m
jrcast commented 7 years ago

@mumoshu I am using flannel. However, as I noted above, my problem only happens if I chose Kubernetes 1.5.x. I dont see the issue with 1.6.x . Based on your comments, it sounds like 1.5.x is not supported in 0.9.6+ This is a bummer since (at the moment) spinnaker does not support Kubernetes 1.6.x thanks to kubernetes API changes. So I guess I will be stuck using kube-aws 0.9.5 + kubernetes 1.5.6

cknowles commented 7 years ago

@mumoshu not using Calico. Looks like these are just similar issues but different ones. I'll open another issue for my scenario once I understand a bit more about what is happening.

mumoshu commented 7 years ago

@jcastillo-cb Thanks for the confirmation! I haven't tried myself but if you're on kube-aws v0.9.6 + k8s v1.5.x + flannel, only thing which would affect you is the default etcd version changed from 2.x to 3.x. Would you mind adding the below in your cluster.yaml and see if it works?

etcd:
  version: 2
mumoshu commented 7 years ago

@cmcconnell1 Thanks for the info! Your detailed report, and my own experience arose today, seems to reveal the root cause of our issues.

For me, one of nodes has been consistently NotReady for days and kube-dns has been somehow trapped into CrashLookBackOff: https://gist.github.com/mumoshu/2e0b9e6887a85ad542c83c2b9745b9d3 Interestingly, the NotReady node had the private IP of an etcd node ❗️

I guess you've recreated problematic clusters with the same clusterName used before for the successful cluster? If so, I believe our problems is caused by automatic etcd cluster restoration from etcd snapshots. Automatic etcd snapshots are persisted under a directory prefixed with the S3 uri specified by --s3-uri.

Can you clear your S3 bucket specified in --s3-uri and try recreating your cluster?

redbaron commented 7 years ago

kube 1.6.2 was just released with following fix which again might have something to do with your kube-dns misbehaviour: https://github.com/kubernetes/kubernetes/pull/44102

TLDR; wrong secret might be mounted as a service account key

cmcconnell1 commented 7 years ago

Hey @mumoshu / All

I tried a couple of different slight tweaks and re-deploys, using most recent v0.9.6-rc.2 and seems that as @mumoshu noted above, having cruft in the clusters S3 bucket seems to have caused blocking issues/failures for me.

Interestingly for me, when I tried deploying with v0.9.6-rc.2 and configuring etcd to version: 2, I got a strange failure about the controller failing and that resulted in a failed cnf stack, etc.:

Creating AWS resources. Please wait. It may take a few minutes. Error: Error creating cluster: Stack creation failed: CREATE_FAILED : The following resource(s) failed to create: [Controlplane].

the cfn stack failure/barf:

Printing the most recent failed stack events:
CREATE_FAILED AWS::CloudFormation::Stack cmcc-prod The following resource(s) failed to create: [Controlplane].
CREATE_FAILED AWS::CloudFormation::Stack Controlplane Embedded stack arn:aws:cloudformation:us-west-1:076613928512:stack/cmcc-prod-Controlplane-1R446B6CG6PXF/81d62830-2604-11e7-8466-500cf8eeb88d was not successfully created: The following resource(s) failed to create: [Controllers].

Sadly I seem to have missed grabbing the part of the error that meant something, recall it was due to not being able to maintain the required number of nodes for the ASG I will post my cluster.yaml file in a gist which is working now for me YMMV.

But what is curious is that the above error was with using the exact same controller ASG settings that I've been using for many previous releases (and the same settings which worked successfully right after this failure by simply omitting all etcd version specifications in the cluster.yaml file). It seems to me to likely be an issue with etcd version 2 and latest kube version? But I can't really afford to spend more time on this. So, to summarize, I was not able to deploy with most recent v0.9.6-rc.2 and etcd: version: 2

But perhaps more importantly, I was able to deploy successfully with latest v0.9.6-rc.2 by simply commenting out any/all etcd version specifications and starting with a clean S3 bucket and new cluster name. What I am not able to explain is that numerous kube-aws destroy and redeploys are now successful. So perhaps something got corrupt in the S3 bucket data that once purged solves the problem or it was random or intermittent or something else got fixed that I don't have visibility into? Not sure, but I would recommend those who have issues to do as @mumoshu asked and purge the S3 bucket and redeploy as I'm good with the lastet kube-aws version.

On that note, I've deployed mulitple times now on most recent and cannot reproduce the failure with the crashed pods, etc.

cat cmcc-prod-deployment-stats (shows previous cluster deploys and metadata)

##### Kube Cluster Provision Details #####
KUBE-CLUSTER-NAME: cmcc-prod
DATE: 2017-04-20-12:09:14-PDT
KUBE-AWS-VERSION: v0.9.6-rc.2
AMI-ID:
##########################################
##### Kube Cluster Provision Details #####
KUBE-CLUSTER-NAME: cmcc-prod
DATE: 2017-04-20-13:04:13-PDT
KUBE-AWS-VERSION: v0.9.6-rc.2
AMI-ID:
##########################################
##### Kube Cluster Provision Details #####
KUBE-CLUSTER-NAME: cmcc-prod
DATE: 2017-04-20-16:14:36-PDT
KUBE-AWS-VERSION: v0.9.5
AMI-ID:
##########################################
aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name]' --output text | column -t | grep -i "${cluster_name}" | grep 'running'
i-0d4bbfc2ef4adaf2c  cmcc-prod-control-plane-kube-aws-etcd-1       running
i-0067874089f39e43d  cmcc-prod-cmcc-prod-1a-kube-aws-worker        running
i-07b1e492f2121adf0  cmcc-prod-control-plane-kube-aws-controller   running
i-0d33c9aec724a15cf  cmcc-prod-control-plane-kube-aws-etcd-2       running
i-0319789f350feaadc  cmcc-prod-cmcc-prod-1b-kube-aws-worker        running
i-0998ae82d5f8e3c5a  cmcc-prod-control-plane-kube-aws-etcd-0       running

kk version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T22:51:36Z", GoVersion:"go1.8.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1+coreos.0", GitCommit:"9212f77ed8c169a0afa02e58dce87913c6387b3e", GitTreeState:"clean", BuildDate:"2017-04-04T00:32:53Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

kk get po
NAME                                                               READY     STATUS    RESTARTS   AGE
heapster-v1.3.0-76786035-x4s8f                                     2/2       Running   0          29m
kube-apiserver-ip-10-1-21-46.us-west-1.compute.internal            1/1       Running   0          28m
kube-controller-manager-ip-10-1-21-46.us-west-1.compute.internal   1/1       Running   0          28m
kube-dns-3816048056-ld0mm                                          4/4       Running   0          29m
kube-dns-3816048056-pb8zf                                          4/4       Running   0          29m
kube-dns-autoscaler-1464605019-96cvf                               1/1       Running   0          29m
kube-proxy-ip-10-1-20-209.us-west-1.compute.internal               1/1       Running   0          22m
kube-proxy-ip-10-1-21-253.us-west-1.compute.internal               1/1       Running   0          22m
kube-proxy-ip-10-1-21-46.us-west-1.compute.internal                1/1       Running   0          30m
kube-scheduler-ip-10-1-21-46.us-west-1.compute.internal            1/1       Running   0          28m
kubernetes-dashboard-v1.5.1-61mnn                                  1/1       Running   0          29m
tiller-deploy-1332173772-s2ghj                                     1/1       Running   0          29m

will scrub my cluster.yaml file and post as well in an update.

jrcast commented 7 years ago

@mumoshu I tried using version: 2 for the etcd and the cluster fails to start. The controller's ASG fails to create/signal back.

Using basically the same cluster.yml I tried the following scenarios:

Let me know what logs/info you need to help solve this issue. I'm kinda stuck with 1.5.x at the moment so can't really use 1.6.x yet.

redbaron commented 7 years ago

@jrcast , if you are stuck with 1.5 you might be interested in https://github.com/kubernetes-incubator/kube-aws/issues/599

jeremyd commented 7 years ago

This looks fixed; hasn't happened to me in a while now. Closing!

baconalot commented 6 years ago

I had the same kube-dns problems: Error getting metrics from dnsmasq: read udp 127.0.0.1:40827->127.0.0.1:53: read: connection refused A cluster destroy, purge of s3 bucket and cluster recreation did the trick. Cluster is now finally running. On to the next certificate error...