kubernetes-retired / contrib

[EOL] This is a place for various components in the Kubernetes ecosystem that aren't part of the Kubernetes core.
Apache License 2.0
2.46k stars 1.68k forks source link

restart policy should be 'always' #2990

Closed llamahunter closed 5 years ago

llamahunter commented 5 years ago

The 'on-failure' restart policy means that if the kublet exits 'normally', it will not be restarted. Problem is that systemd considers processes that are killed due to unhandled SIGPIPE as 'clean exits'. See the 'Restart=' section of https://www.freedesktop.org/software/systemd/man/systemd.service.html, in particular: "clean exit means an exit code of 0, or one of the signals SIGHUP, SIGINT, SIGTERM or SIGPIPE" "If set to on-failure, the service will be restarted when the process .. is terminated by a signal (including on core dump, but excluding the aforementioned four signals)" I saw this happen on my own cluster. Kubelet got a SIGPIPE somehow, and exited, but because the systemd restart policy was 'on-failure', it didn't restart.

k8s-ci-robot commented 5 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: llamahunter To fully approve this pull request, please assign additional approvers. We suggest the following additional approver: mikedanese

If they are not already assigned, you can assign the PR to them by writing /assign @mikedanese in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - **[init/OWNERS](https://github.com/kubernetes/contrib/blob/master/init/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
llamahunter commented 5 years ago

Here was the state I found it:

[ec2-user@ip-10-243-243-111 ~]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubelet-args.conf
   Active: inactive (dead) since Mon 2019-01-07 22:20:37 UTC; 2h 21min ago
     Docs: https://github.com/kubernetes/kubernetes
  Process: 3657 ExecStart=/usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --allow-privileged=true --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --network-plugin cni $KUBELET_ARGS $KUBELET_EXTRA_ARGS (code=killed, signal=PIPE)
  Process: 3638 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT (code=exited, status=0/SUCCESS)
 Main PID: 3657 (code=killed, signal=PIPE)
    Tasks: 0
   Memory: 5.3M
   CGroup: /system.slice/kubelet.service

Killed and not restarting.

rootfs commented 5 years ago

what caused SIGPIPE in your setup? I suggest we fix that problem instead of changing restart mode (which may cause premature restarts if that error persists)

llamahunter commented 5 years ago

The SIGPIPE is caused by journald restarting. Note that a similar fix was made recently to the AWS EKS AMI for running kubernetes in EC2. They chose to treat SIGPIPE as a failure, but still leave the restart policy as on-failure. I guess it's less of a delta, but I can't figure out why you would want kubelet to not be running if you've told systemd that you want it running. Seems to me that the restart policy should be 'always'.

llamahunter commented 5 years ago

fwiw, here's the AWS AMI change. https://github.com/awslabs/amazon-eks-ami/commit/e13d401e531e10f1a66a62d7b54ce109f2375859

llamahunter commented 5 years ago

See also https://github.com/awslabs/amazon-eks-ami/pull/110

llamahunter commented 5 years ago

and https://github.com/awslabs/amazon-eks-ami/issues/122

travcunn commented 5 years ago

@llamahunter I think the restart policy should look like this:

Restart=on-failure
RestartForceExitStatus=SIGPIPE

In my opinion, this is better because if a user runs SIGKILL against the process, the user intends to kill the process and they won't want systemd to restart it. Here is the final change that made its way into the amazon-eks-ami: https://github.com/awslabs/amazon-eks-ami/pull/137

llamahunter commented 5 years ago

Regardless, the current systemd unit file needs to change.