rancher / k3os

Purpose-built OS for Kubernetes, fully managed by Kubernetes.
https://k3os.io
Apache License 2.0
3.5k stars 397 forks source link

K3OS single node cluster broken on-its-own #718

Open kallex opened 3 years ago

kallex commented 3 years ago

I'm running K3OS currently in single code cluster; kubehost and one master node running on default installation - on Hyper-V. This setup has been running without issues for quite a while.

Now suddenly today when doing Windows-required normal reboot, the cluster failed to come up.

Errors I can dig out are:

My usually working portforward:

error: error upgrading connection: error dialing backend: x509: certificate is valid for kubehost, localhost, not k3os-12063

kubectl nodes =>

NAME STATUS ROLES AGE VERSION k3os-12063 NotReady master 493d v1.19.5+k3s2 kubehost Ready master 40m v1.19.5+k3s2

I cannot find any way to check what's failing in the server's "embedded" agent.

I tried to run manual upgrade on the cluster to see if it helps the situation, I could get Kubernetes Dashboard to run then (now it's erroring too), but agent still stayed as NotReady.

If someone could point me to right direction, would be great.

kallex commented 3 years ago

I removed the NotReady master node and dashboard works now (the remaining kubehost master seems to behave better).

Now I'm trying to find the proper documentation about how to (re-)attached the installed agent node back - that's running with the same host as the master.

kz159 commented 3 years ago

how did you removed NotReady? mine is struggling with

Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Fri, 18 Jun 2021 22:45:02 +0000   Fri, 18 Jun 2021 22:45:02 +0000   FlannelIsUp         Flannel is running on this node
  MemoryPressure       Unknown   Fri, 18 Jun 2021 23:00:44 +0000   Fri, 18 Jun 2021 23:04:30 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Fri, 18 Jun 2021 23:00:44 +0000   Fri, 18 Jun 2021 23:04:30 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Fri, 18 Jun 2021 23:00:44 +0000   Fri, 18 Jun 2021 23:04:30 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Fri, 18 Jun 2021 23:00:44 +0000   Fri, 18 Jun 2021 23:04:30 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
kallex commented 3 years ago

NOTE! It might not improve things, might get worse! But I don't have anything important on my cluster, so I'm more or less experimenting. Also I have data on PV, which I expect not to die from node removes (but they might, because it was a master node).

I was preparing to reinstall the node (and have hopes of reinstalling would bring workloads back on it), so just deleted it.

kubectl delete node node-name-here

thanhtoan1196 commented 3 years ago

same here

image

dweomer commented 3 years ago

I haven't worked much with Hyper-V but it looks as if k3OS failed to detect a hostname override via config (assuming that you have something like hostname: kubehost in your /k3os/system/config.yaml

patrik-upspot commented 2 years ago

I have the same problem if i restart my VServer of netcup (https://www.netcup.de/vserver/vps.php -> VPS 4000 G9). I have a ticket -> #734 After typing "sudo k3os config" i can delete the wrong node and the old one oes up correctly.

rickard-von-essen commented 2 years ago

I can confirm this too.

After first reboot the hostname changes from k3os to k3os-21898 and K8s thinks there is two master nodes, one unavailable. I didn't have any hostname specified in my config.yaml.

rickard-von-essen commented 2 years ago

The hostname sees to come from this line boot#L132