scholzj / terraform-aws-kubernetes

Terraform module for Kubernetes setup on AWS
Apache License 2.0
202 stars 129 forks source link

Nodes regularly failing after a day #13

Closed sarge closed 6 years ago

sarge commented 6 years ago

Hi there,

I understand this is not a bug with your implementation.

We are seeing an issue with nodes failing to start pods with the following errors.

MountVolume.SetUp failed for volume "default-token-hrpk9" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/f570149c-5462-11e8-a72b-02a9a02d1274/volumes/kubernetes.io~secret/default-token-hrpk9 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/f570149c-5462-11e8-a72b-02a9a02d1274/volumes/kubernetes.io~secret/default-token-hrpk9 Output: Failed to start transient scope unit: Connection timed out

SystemD has become unresponsive.

We have a few CronJobs running which I suspect is causing the systemd to eventually be unable to mount new secrets. Increasing the CronJob rate hasn't made much affect.

The closest issue I have seen is https://github.com/kubernetes/kubernetes/issues/57345 but the issue varies slightly. Remoting into the machine systemd is completely unresponsive.

systemctl list-units --all | wc -l
Failed to list units: Connection timed out
0

What I know

  1. System reboot fixes the issue.
  2. Systemd is an older version, and unlikely to be updated anytime soon
  3. Possibly not a kubernetes issue

Any suggestions on where to hunt?

scholzj commented 6 years ago

Sorry, I have never seen this problem. So I'm afraid I cannot help with this :-(. In the past I saw lot of strange issues when the disk space run out. But I'm not sure if this is your case.

sarge commented 6 years ago

Thanks Jakub, I appreciate the quick response.

artemyarulin commented 6 years ago

Just for the record - got the same issue with managed k8s on GCP and also because of high use of CronJobs, I guess the only solution is to get Systemd of version 237 which has a fix

Jason7602 commented 5 years ago

I encountered such problem, reboot your system is the fast way to fix this case

eugenestarchenko commented 5 years ago

same case here on Azure AKS - restart of all nodes helped

KrustyHack commented 5 years ago

Same problem on GKE with COS nodes (systemd 232). Switching to Ubuntu nodes with systemd 237 to see if it solves the problems.