kube-node-drainer crash-loops on m5 instances due to wget error

whereisaaron commented 6 years ago

Using kube-aws 0.9.9 with k8s 1.8.13 and kube-node-drainer works fine on various instance types. But if I add a node pool of m5.xlarge instances the kube-node-drainer on those nodes goes into a crash loop.

The logs of the crashed containers say wget failed.

+ metadata meta-data/instance-id
+ wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
wget: error getting response
+ INSTANCE_ID=

But if I jump on one of the m5.xlarge nodes myself and run that command myself there is no problem.

$ wget -O - -q http://169.254.169.254/2016-09-02/meta-data/instance-id
i-0497e12345e5c7592

Perhaps something with the alpine awscli image that its wget doesn't work on m5 instances? I am using this one:

awsCliImage:
  repo: quay.io/coreos/awscli
  tag: master
  rktPullDocker: false

https://github.com/coreos/awscli/blob/master/Dockerfile

FROM alpine:3.6
MAINTAINER colin.hom@coreos.com

RUN apk --no-cache --update add bash curl less groff jq python py-pip && \
  pip install --no-cache-dir --upgrade pip && \
  pip install --no-cache-dir awscli==1.11.167 s3cmd==2.0.0 https://s3.amazonaws.com/cloudformation-examples/aws-cfn-bootstrap-1.4-24.tar.gz && \
  mkdir /root/.aws && \
  aws --version && \
  s3cmd --version

ENTRYPOINT []

whereisaaron commented 6 years ago

Could be related to problems with the busybox imitation wget in alpine 3.6. There are patches from a while but they appear to have not been merged. One person reported it is still broken with the busybox in alpine 3.7. The busybox wget in alpine 3.5 does apparently not have the problem.

https://github.com/gliderlabs/docker-alpine/issues/292 https://github.com/gliderlabs/docker-alpine/issues/344 https://git.busybox.net/busybox/commit/?id=a6f8651911716d1d1624712eb19e4f3608767c7e https://github.com/apache/incubator-openwhisk/pull/3715

The suggested workaround is to install a proper wget package. apk add --no-cache wget

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-incubator/kube-aws/issues/1369#issuecomment-504723367): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / kube-aws

kube-node-drainer crash-loops on m5 instances due to wget error #1369