Closed cknowles closed 6 years ago
Thanks for reporting!
Not sure this is affecting everyone's cluster, but let's improve node drainer to use curl
instead and see if it works, according to our observation in the referenced issue.
@c-knowles @mumoshu How can I solve this? I am new to K8s 😄
@c-knowles @mumoshu This happens for me on new nodes or nodes that recently pulled the aws-cli container image.
bash-4.3# wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
wget: error getting response
bash-4.3# apk update
fetch http://dl-cdn.alpinelinux.org/alpine/v3.6/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.6/community/x86_64/APKINDEX.tar.gz
v3.6.2-248-g3f8eeb3ea1 [http://dl-cdn.alpinelinux.org/alpine/v3.6/main]
v3.6.2-250-g51a3714b5e [http://dl-cdn.alpinelinux.org/alpine/v3.6/community]
OK: 8437 distinct packages available
bash-4.3# apk add wget
(1/1) Installing wget (1.19.2-r0)
Executing busybox-1.26.2-r9.trigger
OK: 75 MiB in 35 packages
bash-4.3# wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
i-0fccf94cf2be4de7a
bash-4.3#
Not sure if mine is the same issue - if not, I'm happy to raise a separate issue but what I have seen is that from a working k8s cluster that was spun up with kube-aws, where node-drainer is all working fine and they have started fine, IF i change the instance type of worker nodes to the new c5 range (actually i only tried with c5.4xlarge) and apply so that my worker nodes are now the new instance type, the node-drainers now start to CrashLoopBackOff
.
The instance type is the ONLY thing I have changed between a working node-drainers on my worker nodes. i.e. Working:
kube-system kube-node-drainer-ds-6xd9r 1/1 Running 0 8m
kube-system kube-node-drainer-ds-kbpgv 1/1 Running 0 8m
kube-system kube-node-drainer-ds-kmr9r 1/1 Running 0 8m
Non-Working:
kube-system kube-node-drainer-asg-status-updater-f9f67c9c7-w7gwg 0/1 CrashLoopBackOff 5 5m
kube-system kube-node-drainer-ds-58vnq 0/1 CrashLoopBackOff 6 8m
kube-system kube-node-drainer-ds-qlg76 0/1 CrashLoopBackOff 6 8m
kube-system kube-node-drainer-ds-scsxl 0/1 CrashLoopBackOff 6 8m
I have also included the node-drainer-asg-status-updater because that is also failing now on the new c5 instance types with logs showing me:
kubectl logs kube-node-drainer-asg-status-updater-f9f67c9c7-w7gwg -n kube-system
+ metadata dynamic/instance-identity/document
+ wget -O - -q http://169.254.169.254/2016-09-02/dynamic/instance-identity/document
+ jq -r .region
wget: error getting response
+ REGION=
+ [ -n ]
node-drainer also gives me same error message.
It does sound like a different problem so let me know if a new issue is required. I have found https://github.com/coreos/bugs/issues/2331 which is kind of related.
FYI, I just spun up test container on the same worker node (c5.4xlarge) that has the failing node-drainer pod and from that test container, the same wget that fails from node-drainer works ok.
@vaibhavsingh97 a quick way to resolve is change the wget calls in the controller config to use curl and then update your cluster with that. Longer term I think we should change the default in kube-aws so it does not always pull master.
Thanks @c-knowles for pointing it out. I will make PR 👍
Looks like there is no good alternative tag - https://quay.io/repository/coreos/awscli?tag=latest&tab=tags. @mumoshu perhaps we should build an AWS CLI image ourselves or use a different one from docker hub so we can pinned the version a bit better?
Yeah, let's build one ourselves
@c-knowles Just forked coreos/awscli to https://github.com/kube-aws/docker-awscli.
Would you mind sending a PR for switching to curl?
The docker repo is also available at https://hub.docker.com/r/kubeaws/awscli/ with automated build enabled.
@mumoshu yeah ok, I will pick this item up now.
@mumoshu PR done. I haven't swapped to the new image yet, I see there's some issues on the coreos repo to get some versions pinned.
Oops!! Looks like I am late 😄 , Any other beginner issue i can solve?
@vaibhavsingh97 sorry! Lots of good first issues, I'd suggest one of https://github.com/kubernetes-incubator/kube-aws/issues/950, https://github.com/kubernetes-incubator/kube-aws/issues/1085 or https://github.com/kubernetes-incubator/kube-aws/issues/1063.
@c-knowles Thanks for the PR! I'll take a look soon.
Regarding the awscli image pinning, I've just pushed kubeaws/awscli:0.9.0
via automated-build. I'd appreciate it if you could submit PRs to change awscli used by kube-aws to that one! @vaibhavsingh97 @c-knowles
// Btw, it works like this: As soon as a git tag is pushed, the automated build for the image tag with the same value as the git tag is triggered.
@mumoshu I would happy to submit PR, Can you please redirect to the resources.
Extracted from https://github.com/kubernetes-incubator/kube-aws/issues/1105#issuecomment-363316491.
The current node drainer scripting uses wget but it seems to have some problems with 502s:
On a
Running
node drainer pod, I've exec-ed to the pod and done this:Probably related issue in wget in busybox: http://svn.dd-wrt.com/ticket/5771