kubernetes-retired / kube-aws

[EOL] A command-line tool to declaratively manage Kubernetes clusters on AWS
Apache License 2.0
1.12k stars 295 forks source link

Kube node drainer service fails to start. #40

Closed camilb closed 7 years ago

camilb commented 7 years ago

I think the line Restart=on-failure from kube-node-drainer.service should be removed.

Getting these errors:

Failed to restart kubelet.service: Unit kube-node-drainer.service is not loaded properly: Invalid argument. kube-node-drainer.service: Service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.

camilb commented 7 years ago

41

pieterlange commented 7 years ago

Easy mistake to slip in..

I think i've spotted another one:

        ExecStop=/bin/sh -c '/usr/bin/docker run --rm -v /etc/kubernetes:/etc/kubernetes {{.HyperkubeImageRepo}}:{{.K8sVer}} \
          /hyperkube kubectl \
          --server=https://{{.ExternalDNSName}}:443 \
          --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
          drain $$(hostname) \
          --ignore-daemonsets \
          --force'

Unless i'm missing something here, the double $$ seems to be an error. Is this some escaping trick? I think we should just be able to run $(hostname), right?

camilb commented 7 years ago

@pieterlange I see that it's working with double $$as with one $.

camilb commented 7 years ago

Related to this, but looks like it's affecting other services too, using ExecStartPre=/usr/bin/systemctl is-active kubelet.service will check if the service is active then exit (code=exited, status=3). This being a oneshotservice, it will not be restarted. Also observed this on install-calico-system.service and install-kube-system.service. Using ExecStartPre=/usr/bin/systemctl is-active service.name it will generate a lot of errors before kubelet.service becomes active.

In this case I think is better to use something like:

After=multi-user.target

[Install]
WantedBy=node-drain.target

This way we make sure that all the services are running before we start this one.

Here is a proposal:

     [Unit]
     Description=drain this k8s node to make running pods time to gracefully shut down before stopping kubelet
     After=multi-user.target
     Wants=decrypt-tls-assets.service kubelet.service docker.service

     [Service]
     Type=oneshot
     RemainAfterExit=true
     ExecStart=/bin/sh -c '/usr/bin/docker run --rm -v /etc/kubernetes:/etc/kubernetes {{.HyperkubeImageRepo}}:{{.K8sVer}} \
       /hyperkube kubectl \
       --server=https://{{.ExternalDNSName}}:443 \
       --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
       uncordon $(hostname)
     ExecStop=/bin/sh -c '/usr/bin/docker run --rm -v /etc/kubernetes:/etc/kubernetes {{.HyperkubeImageRepo}}:{{.K8sVer}} \
       /hyperkube kubectl \
       --server=https://{{.ExternalDNSName}}:443 \
       --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
       drain $(hostname) \
       --ignore-daemonsets \
       --force'

     [Install]
     WantedBy=node-drain.target
mumoshu commented 7 years ago

@pieterlange replied to you in https://github.com/coreos/kube-aws/pull/41#issuecomment-259003781 about $$

mumoshu commented 7 years ago

@camilb Thanks for your feedback and proposal!

I'm not intended to just stick with the is-active method but anyways it seems that we may have 3 issues here:

  1. a lot of errors before starting kubelet, flanneld, etc.
  2. the node-drainer isn't coming up at all(!)
  3. un-cordoning nodes

For 1, I believe we can use RestartSec as used in https://github.com/coreos/coreos-baremetal/blob/master/examples/ignition/bootkube-controller.yaml#L66 to alleviate the issue. Should we start from RestartSec=10 anyway?

For 2, a part of issue is resolved thanks to your pr #41. To tackle the other part of issue, I began to believe your proposal, that uses After=multi-user.target to control order of service startup, is the only way to go!

For 3, though I'm rather looking forward with it, I'm not familiar with its use-case!

mumoshu commented 7 years ago

Added to the known-issues list https://github.com/coreos/kube-aws/releases/tag/v0.9.1-rc.1

pieterlange commented 7 years ago

Uncordoning will be necessary when the node was restarted due to manual operator intervention instead of a rolling upgrade. Not sure how often we'd see that in practice (who reboots nodes in an ASG?) but it might cause some 'funny' side-effects.

camilb commented 7 years ago

@mumoshu

  1. I will do some testing today with RestartSec and also try to set a order for services to see which it's more effective.
  2. I'm already using After=multi-user.target and works fine.
  3. We could stick with ExecStart=/bin/true in this case. @pieterlange Actually had to reboot the nodes several time on a staging cluster due the Docker 10.3 problems on overload. I know that you can lose the node in an ASG but it's faster.
mumoshu commented 7 years ago

@camilb Thanks for your cooperation here.

Just doing a quick reply to 1., fyi, we've already been hit by the 51200 bytes limit in the master branch. If you are going to start testing on top of it, I'd like to encourage you to try https://github.com/coreos/kube-aws/pull/45 as the base branch for testing!

camilb commented 7 years ago

@mumoshu Thanks, I will try it. Hitted the limit several times and was using kube-aws up --export with S3.

camilb commented 7 years ago

@mumoshu Finished testing.

  1. kube-aws up --s3-uri s3://my_bucket/kube-aws works fine
  2. kube-aws update --s3-uri s3://my_bucket/kube-aws always shows this error:

    Error: Error updating cluster: error updating cloudformation stack: ValidationError: Template    format error: unsupported structure.
    status code: 400, request id: 0d04f84f-a607-11e6-b3c5-2548ad1d19cd

But works fine with aws cloudformation update-stack.

  1. Regarding kube-node-drainer.service

docker.service it's stopped at the same time with kube-node-drainer.service Less than 10% of times it worked. I did try several configurations like Wanted-by or Required-by=poweroff.target reboot.target halt.target etc... Then swithced to rkt and after few intents I found a solution that works fine with rolling-updates, manual shutdown, reboot, etc. Tested several times and it din't fail.

- name: kube-node-drainer.service
  enable: true
  command: start
  runtime: true
  content: |
    [Unit]
    Description=drain this k8s node to make running pods time to gracefully shut down before stopping kubelet
    After=multi-user.target

    [Service]
    Type=oneshot
    RemainAfterExit=true
    ExecStart=/bin/true
    TimeoutStopSec=30s
    ExecStop=/bin/sh -c '/usr/bin/rkt run \
    --volume=kube,kind=host,source=/etc/kubernetes,readOnly=true \
    --mount=volume=kube,target=/etc/kubernetes \
    --net=host \
    quay.io/coreos/hyperkube:v1.4.5_coreos.0 \
      --exec=/kubectl -- \
      --server=https://{{.ExternalDNSName}}:443 \
      --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
      drain $(hostname) \
      --ignore-daemonsets \
      --force'

    [Install]
    WantedBy=multi-user.target
mumoshu commented 7 years ago

@camilb

mumoshu commented 7 years ago

FYI, I have just merged #48 which, in combination with #41, fixes what is stated in the title of this issue.

mumoshu commented 7 years ago

Revisiting & thinking about @pieterlange's comment at https://github.com/coreos/kube-aws/issues/40#issuecomment-259049021 and @camilb's comment at https://github.com/coreos/kube-aws/issues/40#issuecomment-259056123.

In addition to what you've mentioned, would the uncordon feature + the node drainer allow us to automatically upgrade CoreOS version, which implies automatic rebooting, hopefully without downtime? If that is the case, as there're several possible use-cases already, I thought it would be nice to keep discussing about it in an another issue.

camilb commented 7 years ago

@mumoshu I will close this for now. Still not perfect but works better now. The request to drain the node is properly sent and the pods are started on other nodes. The problem is the containers are quickly killed on the drained node and for some pods that are using bigger images or need a longer time to start/stop, there is not enough time to be started on other nodes. I'm looking for a good method to delay stopping some services on shutdown or reboot. Saw some examples on Redhat and want to test them. I will open another issue with a proposal for improvements soon.