metral / corekube

CoreOS + Kubernetes + OpenStack - The simplest way to deploy a POC Kubernetes cluster using a Heat template
Apache License 2.0
7 stars 0 forks source link

Overlord doesn't finish completion due to failed unit in fleet #3

Closed metral closed 9 years ago

metral commented 9 years ago

Overlord continues to wait for a node to finish running its service but it will indefinitely sit there waiting and not continue.

The reason is that a unit or multiple units failed on a node as shown when doing a fleetctl list-units

When you log into the failed node and issue a fleetctl status <unit_name> it gives an error due to not being able to pull the binaries from the Internet. This is due to the fact that systemd-networkd was restarted too many times and it exited from continuing to attempt restarting the service as shown in systemdctl status systemd-networkd. The issue has to do with the issuance of multiple network devices requiring a restart of systemd-networkd and its restarting too many times for it to be happy.

This issue happens every so often but not always.

The simple work around unfortunately is to destroy the stack all-together via Heat and create a new stack or restart the systemd-networkd unit as well as any other units on the failed nodes, and then observing of the logs of the overlord to make sure it completed

metral commented 9 years ago

Closing issue as manual/hard-coded networking of all network devices & configs for overlay has been replaced by CoreOS Flannel; therefore, this issue no longer exists thanks to commit https://github.com/metral/corekube/commit/2ab5885d6925eedc5c08593ef8904c48b7d0f893