wget in node drainer skip has stopped working

cknowles commented 6 years ago

Extracted from https://github.com/kubernetes-incubator/kube-aws/issues/1105#issuecomment-363316491.

The current node drainer scripting uses wget but it seems to have some problems with 502s:

wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
wget: server returned error: HTTP/1.1 502 Bad Gateway

On a Running node drainer pod, I've exec-ed to the pod and done this:

/ # wget -O - -q http://169.254.169.254/latest/meta-data/
wget: server returned error: HTTP/1.1 502 Bad Gateway

/ # curl http://169.254.169.254/latest/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
hostname
[...]

Probably related issue in wget in busybox: http://svn.dd-wrt.com/ticket/5771

mumoshu commented 6 years ago

Thanks for reporting! Not sure this is affecting everyone's cluster, but let's improve node drainer to use curl instead and see if it works, according to our observation in the referenced issue.

vaibhavsingh97 commented 6 years ago

@c-knowles @mumoshu How can I solve this? I am new to K8s 😄

camilb commented 6 years ago

@c-knowles @mumoshu This happens for me on new nodes or nodes that recently pulled the aws-cli container image.

bash-4.3# wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
wget: error getting response
bash-4.3# apk update
fetch http://dl-cdn.alpinelinux.org/alpine/v3.6/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.6/community/x86_64/APKINDEX.tar.gz
v3.6.2-248-g3f8eeb3ea1 [http://dl-cdn.alpinelinux.org/alpine/v3.6/main]
v3.6.2-250-g51a3714b5e [http://dl-cdn.alpinelinux.org/alpine/v3.6/community]
OK: 8437 distinct packages available
bash-4.3# apk add wget
(1/1) Installing wget (1.19.2-r0)
Executing busybox-1.26.2-r9.trigger
OK: 75 MiB in 35 packages
bash-4.3# wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
i-0fccf94cf2be4de7a
bash-4.3#

kiich commented 6 years ago

Not sure if mine is the same issue - if not, I'm happy to raise a separate issue but what I have seen is that from a working k8s cluster that was spun up with kube-aws, where node-drainer is all working fine and they have started fine, IF i change the instance type of worker nodes to the new c5 range (actually i only tried with c5.4xlarge) and apply so that my worker nodes are now the new instance type, the node-drainers now start to CrashLoopBackOff.

The instance type is the ONLY thing I have changed between a working node-drainers on my worker nodes. i.e. Working:

kube-system           kube-node-drainer-ds-6xd9r                                           1/1       Running             0          8m
kube-system           kube-node-drainer-ds-kbpgv                                           1/1       Running             0          8m
kube-system           kube-node-drainer-ds-kmr9r                                           1/1       Running             0          8m

Non-Working:

kube-system           kube-node-drainer-asg-status-updater-f9f67c9c7-w7gwg                 0/1       CrashLoopBackOff    5          5m
kube-system           kube-node-drainer-ds-58vnq                                           0/1       CrashLoopBackOff    6          8m
kube-system           kube-node-drainer-ds-qlg76                                           0/1       CrashLoopBackOff    6          8m
kube-system           kube-node-drainer-ds-scsxl                                           0/1       CrashLoopBackOff    6          8m

I have also included the node-drainer-asg-status-updater because that is also failing now on the new c5 instance types with logs showing me:

kubectl logs kube-node-drainer-asg-status-updater-f9f67c9c7-w7gwg -n kube-system
+ metadata dynamic/instance-identity/document
+ wget -O - -q http://169.254.169.254/2016-09-02/dynamic/instance-identity/document
+ jq -r .region
wget: error getting response
+ REGION=
+ [ -n  ]

node-drainer also gives me same error message.

It does sound like a different problem so let me know if a new issue is required. I have found https://github.com/coreos/bugs/issues/2331 which is kind of related.

kiich commented 6 years ago

FYI, I just spun up test container on the same worker node (c5.4xlarge) that has the failing node-drainer pod and from that test container, the same wget that fails from node-drainer works ok.

cknowles commented 6 years ago

@vaibhavsingh97 a quick way to resolve is change the wget calls in the controller config to use curl and then update your cluster with that. Longer term I think we should change the default in kube-aws so it does not always pull master.

vaibhavsingh97 commented 6 years ago

Thanks @c-knowles for pointing it out. I will make PR 👍

cknowles commented 6 years ago

Looks like there is no good alternative tag - https://quay.io/repository/coreos/awscli?tag=latest&tab=tags. @mumoshu perhaps we should build an AWS CLI image ourselves or use a different one from docker hub so we can pinned the version a bit better?

mumoshu commented 6 years ago

Yeah, let's build one ourselves

mumoshu commented 6 years ago

@c-knowles Just forked coreos/awscli to https://github.com/kube-aws/docker-awscli.

Would you mind sending a PR for switching to curl?

The docker repo is also available at https://hub.docker.com/r/kubeaws/awscli/ with automated build enabled.

cknowles commented 6 years ago

@mumoshu yeah ok, I will pick this item up now.

cknowles commented 6 years ago

@mumoshu PR done. I haven't swapped to the new image yet, I see there's some issues on the coreos repo to get some versions pinned.

vaibhavsingh97 commented 6 years ago

Oops!! Looks like I am late 😄 , Any other beginner issue i can solve?

cknowles commented 6 years ago

@vaibhavsingh97 sorry! Lots of good first issues, I'd suggest one of https://github.com/kubernetes-incubator/kube-aws/issues/950, https://github.com/kubernetes-incubator/kube-aws/issues/1085 or https://github.com/kubernetes-incubator/kube-aws/issues/1063.

mumoshu commented 6 years ago

@c-knowles Thanks for the PR! I'll take a look soon.

Regarding the awscli image pinning, I've just pushed kubeaws/awscli:0.9.0 via automated-build. I'd appreciate it if you could submit PRs to change awscli used by kube-aws to that one! @vaibhavsingh97 @c-knowles

// Btw, it works like this: As soon as a git tag is pushed, the automated build for the image tag with the same value as the git tag is triggered.

vaibhavsingh97 commented 6 years ago

@mumoshu I would happy to submit PR, Can you please redirect to the resources.

kubernetes-retired / kube-aws

wget in node drainer skip has stopped working #1125