Closed indywidualny closed 2 years ago
I suspect there's something unique to your environment or configuration; all the core Kubernetes controllers are lease-locked and should migrate over to a new node within a minute or so of a server node being stopped. Can you attach the logs from all three servers, as well as the output of kubectl get lease -A
when one of the servers is stopped?
Well. The issue is solved. It was configuration specific indeed.
All 3 masters running:
➜ kubernetes-test kubectl get lease -A
NAMESPACE NAME HOLDER AGE
kube-node-lease master-1 master-1 69m
kube-node-lease master-2 master-2 68m
kube-node-lease master-3 master-3 67m
kube-node-lease node-1 node-1 35m
kube-node-lease node-2 node-2 34m
kube-system kube-controller-manager master-1_af86ce8f-1741-431b-a149-520d665b3553 69m
kube-system kube-scheduler master-1_386799f4-ff44-46f2-8758-3af88284eff7 69m
longhorn-system driver-longhorn-io csi-provisioner-669c8cc698-pbwjg 104s
longhorn-system external-attacher-leader-driver-longhorn-io csi-attacher-75588bff58-tj4ls 104s
longhorn-system external-resizer-driver-longhorn-io csi-resizer-5c88bfd4cf-q4lpk 104s
longhorn-system external-snapshotter-leader-driver-longhorn-io csi-snapshotter-69f8bc8dcf-qk9ss 103s
longhorn-system longhorn-manager-upgrade-lock 2m35s
master-3 down, the rest is up:
➜ kubernetes-test kubectl get nodes
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane,etcd,master 86m v1.21.6+k3s1
master-2 Ready control-plane,etcd,master 84m v1.21.6+k3s1
master-3 NotReady control-plane,etcd,master 84m v1.21.6+k3s1
node-1 Ready <none> 52m v1.21.6+k3s1
node-2 Ready <none> 51m v1.21.6+k3s1
➜ kubernetes-test kubectl get lease -A
NAMESPACE NAME HOLDER AGE
kube-node-lease master-1 master-1 86m
kube-node-lease master-2 master-2 85m
kube-node-lease master-3 master-3 84m
kube-node-lease node-1 node-1 52m
kube-node-lease node-2 node-2 51m
kube-system kube-controller-manager master-1_af86ce8f-1741-431b-a149-520d665b3553 86m
kube-system kube-scheduler master-1_386799f4-ff44-46f2-8758-3af88284eff7 86m
longhorn-system driver-longhorn-io csi-provisioner-669c8cc698-pbwjg 18m
longhorn-system external-attacher-leader-driver-longhorn-io csi-attacher-75588bff58-tj4ls 18m
longhorn-system external-resizer-driver-longhorn-io csi-resizer-5c88bfd4cf-q4lpk 18m
longhorn-system external-snapshotter-leader-driver-longhorn-io csi-snapshotter-69f8bc8dcf-4m6n6 18m
longhorn-system longhorn-manager-upgrade-lock 19m
master-2 down, the rest is up:
➜ kubernetes-test kubectl get nodes
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane,etcd,master 95m v1.21.6+k3s1
master-2 NotReady control-plane,etcd,master 94m v1.21.6+k3s1
master-3 Ready control-plane,etcd,master 93m v1.21.6+k3s1
node-1 Ready <none> 61m v1.21.6+k3s1
node-2 Ready <none> 60m v1.21.6+k3s1
➜ kubernetes-test kubectl get lease -A
NAMESPACE NAME HOLDER AGE
kube-node-lease master-1 master-1 95m
kube-node-lease master-2 master-2 94m
kube-node-lease master-3 master-3 93m
kube-node-lease node-1 node-1 61m
kube-node-lease node-2 node-2 60m
kube-system kube-controller-manager master-1_af86ce8f-1741-431b-a149-520d665b3553 95m
kube-system kube-scheduler master-1_386799f4-ff44-46f2-8758-3af88284eff7 95m
longhorn-system driver-longhorn-io csi-provisioner-669c8cc698-pbwjg 27m
longhorn-system external-attacher-leader-driver-longhorn-io csi-attacher-75588bff58-tj4ls 27m
longhorn-system external-resizer-driver-longhorn-io csi-resizer-5c88bfd4cf-q4lpk 27m
longhorn-system external-snapshotter-leader-driver-longhorn-io csi-snapshotter-69f8bc8dcf-4m6n6 27m
longhorn-system longhorn-manager-upgrade-lock
master-1 down (it was problematic before), the rest is up:
➜ kubernetes-test kubectl get nodes
NAME STATUS ROLES AGE VERSION
master-1 NotReady control-plane,etcd,master 101m v1.21.6+k3s1
master-2 Ready control-plane,etcd,master 100m v1.21.6+k3s1
master-3 Ready control-plane,etcd,master 100m v1.21.6+k3s1
node-1 Ready <none> 67m v1.21.6+k3s1
node-2 Ready <none> 67m v1.21.6+k3s1
➜ kubernetes-test kubectl get lease -A
NAMESPACE NAME HOLDER AGE
kube-node-lease master-1 master-1 102m
kube-node-lease master-2 master-2 100m
kube-node-lease master-3 master-3 100m
kube-node-lease node-1 node-1 67m
kube-node-lease node-2 node-2 67m
kube-system kube-controller-manager master-3_e427a3c2-c5b1-486c-89c2-66e00623b104 101m
kube-system kube-scheduler master-3_972c5363-3afd-4ea6-8ec0-ca99694f499c 101m
longhorn-system driver-longhorn-io csi-provisioner-669c8cc698-pbwjg 33m
longhorn-system external-attacher-leader-driver-longhorn-io csi-attacher-75588bff58-tj4ls 33m
longhorn-system external-resizer-driver-longhorn-io csi-resizer-5c88bfd4cf-q4lpk 33m
longhorn-system external-snapshotter-leader-driver-longhorn-io csi-snapshotter-69f8bc8dcf-pgzrb 33m
longhorn-system longhorn-manager-upgrade-lock 34m
It means you were right and all works like expected, the issue was configuration specific. For everyone with the problem I'm posting how to correctly setup k3s on Hetzner Cloud. Previous setup was incorrect.
hcloud context create kubernetes-test
hcloud network create --name network-kubernetes --ip-range 10.0.0.0/16
hcloud network add-subnet network-kubernetes --network-zone eu-central --type server --ip-range 10.0.0.0/16
# Attach all your machines to this network now
cat <<-EOF > /tmp/firewall
[
{
"direction": "in",
"protocol": "tcp",
"port": "22",
"source_ips": [
"0.0.0.0/0",
"::/0"
],
"destination_ips": []
},
{
"direction": "in",
"protocol": "icmp",
"port": null,
"source_ips": [
"0.0.0.0/0",
"::/0"
],
"destination_ips": []
},
{
"direction": "in",
"protocol": "tcp",
"port": "6443",
"source_ips": [
"0.0.0.0/0",
"::/0"
],
"destination_ips": []
},
{
"direction": "in",
"protocol": "tcp",
"port": "any",
"source_ips": [
"10.0.0.0/16"
],
"destination_ips": []
},
{
"direction": "in",
"protocol": "udp",
"port": "any",
"source_ips": [
"10.0.0.0/16"
],
"destination_ips": []
}
]
EOF
hcloud firewall create --name firewall-kubernetes --rules-file /tmp/firewall
# Attach firewall to all your machines by some label selector now
hcloud load-balancer create --type lb11 --location nbg1 --name lb-kubernetes-api
hcloud load-balancer attach-to-network --network network-kubernetes --ip 10.0.0.10 lb-kubernetes-api
hcloud load-balancer add-target lb-kubernetes-api --label-selector role=master --use-private-ip
hcloud load-balancer add-service lb-kubernetes-api --protocol tcp --listen-port 6443 --destination-port 6443
# Add label role=master to all VPS hosting your masters
apt update && apt upgrade -y && apt install apparmor apparmor-utils -y
# Configure variables first
export K3S_TOKEN="[secret]"
export K3S_VERSION="v1.21.6+k3s1"
export LB_EXTERNAL_IP="1.2.3.4" # adjust
export LB_INTERNAL_IP="10.0.0.10" # adjust
# Install K3S on master
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=$K3S_VERSION K3S_TOKEN=$K3S_TOKEN sh -s - server \
--cluster-init \
--disable-cloud-controller \
--disable metrics-server \
--write-kubeconfig-mode=644 \
--node-ip=$(hostname -I | awk '{print $2}') \
--node-external-ip=$(hostname -I | awk '{print $1}') \
--node-name="$(hostname -f)" \
--cluster-cidr="10.244.0.0/16" \
--etcd-expose-metrics=true \
--kube-controller-manager-arg="address=0.0.0.0" \
--kube-controller-manager-arg="bind-address=0.0.0.0" \
--kube-proxy-arg="metrics-bind-address=0.0.0.0" \
--kube-scheduler-arg="address=0.0.0.0" \
--kube-scheduler-arg="bind-address=0.0.0.0" \
--kubelet-arg="cloud-provider=external" \
--node-taint CriticalAddonsOnly=true:NoExecute \
--flannel-iface=ens10 \
--tls-san="$(hostname -I | awk '{print $1}')" \
--tls-san="$(hostname -I | awk '{print $2}')" \
--tls-san="$LB_EXTERNAL_IP" --tls-san="$LB_INTERNAL_IP"
kubectl -n kube-system create secret generic hcloud --from-literal=token=[secret] --from-literal=network=network-kubernetes
kubectl apply -f https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/latest/download/ccm-networks.yaml
I'm using a Hetzner Load Balancer to access all of them from outside later no matter which ones are alive.
apt update && apt upgrade -y && apt install apparmor apparmor-utils -y
export K3S_TOKEN="[secret]"
export K3S_VERSION="v1.21.6+k3s1"
export FIRST_MASTER_PRIVATE_IP="10.0.0.2" # adjust
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=$K3S_VERSION K3S_TOKEN=$K3S_TOKEN sh -s - server \
--disable-cloud-controller \
--disable metrics-server \
--server https://$FIRST_MASTER_PRIVATE_IP:6443 \
--write-kubeconfig-mode=644 \
--node-name="$(hostname -f)" \
--cluster-cidr="10.244.0.0/16" \
--etcd-expose-metrics=true \
--kube-controller-manager-arg="address=0.0.0.0" \
--kube-controller-manager-arg="bind-address=0.0.0.0" \
--kube-proxy-arg="metrics-bind-address=0.0.0.0" \
--kube-scheduler-arg="address=0.0.0.0" \
--kube-scheduler-arg="bind-address=0.0.0.0" \
--node-taint CriticalAddonsOnly=true:NoExecute \
--kubelet-arg="cloud-provider=external" \
--node-ip=$(hostname -I | awk '{print $2}') \
--node-external-ip=$(hostname -I | awk '{print $1}') \
--flannel-iface=ens10 \
--tls-san="$(hostname -I | awk '{print $1}')" \
--tls-san="$(hostname -I | awk '{print $2}')"
apt update && apt upgrade -y && apt install apparmor apparmor-utils -y
export K3S_TOKEN="[secret]"
export K3S_VERSION="v1.21.6+k3s1"
export MASTER_PRIVATE_IP="10.0.0.2" # adjust
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=$K3S_VERSION K3S_TOKEN=$K3S_TOKEN sh -s - agent \
--server https://$MASTER_PRIVATE_IP:6443 \
--node-name="$(hostname -f)" \
--kubelet-arg="cloud-provider=external" \
--node-ip=$(hostname -I | awk '{print $2}') \
--node-external-ip=$(hostname -I | awk '{print $1}') \
--flannel-iface=ens10
You're welcome! :)
Environmental Info: K3s Version:
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
Describe the bug: I'm experimenting with k3s and high availability so created a cluster basing on: https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/ and noticed that bringing down master used to create the cluster will break the cluster despite its high availability. Pods will stop to be managed, status of this master (only this one is problematic) will be also Ready in spite of being powered off.
Steps To Reproduce:
Then I've joined 2 additional masters.
Later joined 2 nodes.
The problem is cluster can easily survive failure of master-2 or master-3 and still spawn pods etc. however when bringing down master-1 used to initialize the cluster things start to stop working. Pods are no longer managed and status of a dead master-1 stays in status Ready forever.
kubectl get cs
occasionally show errors about components being unhealthy as well which isn't the case when master-2 or master-3 is down.Expected behavior:
master-1 can be down and everything is still up and running since the quorum is maintained by master-2 and master-3
Actual behavior:
Kubernetes cluster is no longer managed, node master-1 is always in Ready state.
Additional context / logs: Will add when I know which logs can help.
Backporting