Closed brandond closed 2 years ago
Nice, dude!
So... the fix will be available in the next version which will be released in a few days assume? :D
1.21 most likely, unless we opt to backport it to the next 1.20 patch release.
Great! I'll just install the new version when it becomes available. Thank you for the hard work.
I'm unable to reproduce this issue with the devices I have at home, so I can't give a good read on validation yet unfortunately. I'll see what I can do about recreating and validating before we release this, but @nirui if you have a test system that you want to give it a try on, you can install from commit id like:
curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=8ace8975d293bf6eb46e27d207fb667a47d282a5 sh -
@rancher-max I don't have a good way to reproduce this on demand either, since it appears to be a flake caused by slower IO and CPU on the node. I think I'm OK with just closing it out for the moment; we can reopen if someone is able to reproduce it with the current fix applied.
Yeah, you can close this for now if it troubles you. I'll test the fix later this week and post my findings if there is any.
Also, indeed the hardware is slow, but it boots fine with --no-flannel
, that is why I think it's weird and worth a report.
So... I just want to come back and report the fix did not work...
I've rebuilt the test environment with two of aforementioned hardware. Both nodes in the cluster were initialized with the specified parameter INSTALL_K3S_COMMIT=8ace8975d293bf6eb46e27d207fb667a47d282a5
as indicated in the comment above, and the agent node was initialized with two additional parameters K3S_URL
and K3S_TOKEN
.
After confirming the agent node has joined the cluster and the cluster is fully up and operational, I preformed command systemctl restart k3s
on the control node. The command failed, and produced following error logs:
k3s[15660]: I0312 19:42:25.613035 15660 trace.go:205] Trace[1541134339]: "List etcd3" key:/traefik.containo.us/traefikservices,resourceVersion:0,resourceVersionMatch:,limit:500,continue: (12-Mar-2021 19:42:24.878) (total time: 734ms):
k3s[15660]: Trace[1541134339]: [734.623511ms] [734.623511ms] END
k3s[15660]: I0312 19:42:25.620330 15660 trace.go:205] Trace[1848616586]: "List" url:/apis/traefik.containo.us/v1alpha1/traefikservices,user-agent:traefik/2.4.2 (linux/arm) kubernetes/crd,client:10.42.0.2 (12-Mar-2021 19:42:24.877) (total time: 742ms):
k3s[15660]: Trace[1848616586]: ---"Listing from storage done" 741ms (19:42:00.619)
k3s[15660]: Trace[1848616586]: [742.340319ms] [742.340319ms] END
k3s[15660]: I0312 19:42:25.656805 15660 trace.go:205] Trace[1149487375]: "List etcd3" key:/traefik.containo.us/traefikservices,resourceVersion:,resourceVersionMatch:,limit:10000,continue: (12-Mar-2021 19:42:24.888) (total time: 767ms):
k3s[15660]: Trace[1149487375]: [767.912176ms] [767.912176ms] END
k3s[15660]: E0312 19:42:26.858698 15660 controller.go:156] Unable to remove old endpoints from kubernetes service: no master IPs were listed in storage, refusing to erase all endpoints for the kubernetes service
k3s[15660]: E0312 19:42:28.153220 15660 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.247.207:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.247.207:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.43.247.207:443: connect: no route to host
k3s[15660]: E0312 19:42:31.240932 15660 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.247.207:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.247.207:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.43.247.207:443: connect: no route to host
k3s[15660]: F0312 19:42:32.031358 15660 controllermanager.go:168] error building controller context: failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/priority-and-fairness-config-consumer ok\n[+]poststarthook/priority-and-fairness-filter ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/priority-and-fairness-config-producer ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[-]poststarthook/apiservice-registration-controller failed: reason withheld\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding
k3s[15660]: goroutine 7019 [running]:
k3s[15660]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.stacks(0x59e4e01, 0x0, 0x56d, 0x5b4)
k3s[15660]: /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:1026 +0x94
k3s[15660]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.(*loggingT).output(0x59ceab8, 0x3, 0x0, 0x0, 0xcec8a20, 0x56bdc7e, 0x14, 0xa8, 0x0)
k3s[15660]: /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:975 +0x110
<Stack trace follows>
Based on the log, I suspected the new error was caused by the hardcoded timeout introduced by commit f970e49b7d37f642150bcdcbbc4c7da63ea0eb8f was too short. So I cloned this repository (at 8ace8975d293bf6eb46e27d207fb667a47d282a5), modified the timeout to 1 hour long, then recompiled it by following this instruction. The command systemctl stop k3s
and k3s-killall.sh
was called to shutdown the failing control node, the newly complied binary is deployed, then systemctl restart k3s
. But the same error still occurred.
(Well, the detail is, I've built 3 binaries, one directly from the source, one with 1hour timeout, one wraps the wait call in a infinite try loop. Each one was built by invoking SKIP_VALIDATE=true make
after the modified code is saved. The 1hour timeout one is the one that is actually useful for the test. One weird thing is, all 3 binaries has exactly the same file size but different binary content)
Fetching the cluster info with kubectl
repeatedly returns following:
~# kubectl get all --all-namespaces
E0312 20:22:34.674438 21761 request.go:1011] Unexpected error when reading response body: unexpected EOF
unexpected error when reading response body. Please retry. Original error: unexpected EOF
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
~# kubectl get all --all-namespaces
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get replicationcontrollers)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get services)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get daemonsets.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get deployments.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get replicasets.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get statefulsets.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get horizontalpodautoscalers.autoscaling)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get jobs.batch)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get cronjobs.batch)
~# kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/helm-install-traefik-crd-6fqnp 0/1 Completed 0 131m
kube-system pod/helm-install-traefik-lgq9d 0/1 Completed 2 131m
kube-system pod/svclb-traefik-dgncc 2/2 Running 0 119m
kube-system pod/local-path-provisioner-5ff76fc89d-7brdn 0/1 Error 9 131m
kube-system pod/metrics-server-86cbb8457f-5m2b6 0/1 Error 5 131m
kube-system pod/traefik-8469c8586b-fvdnx 0/1 Unknown 2 125m
kube-system pod/coredns-854c77959c-s6gzb 0/1 Unknown 2 131m
kube-system pod/svclb-traefik-54wzp 0/2 Unknown 4 124m
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.43.0.1 <none> 443/TCP 132m
kube-system service/kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP,9153/TCP 131m
kube-system service/metrics-server ClusterIP 10.43.247.207 <none> 443/TCP 131m
kube-system service/traefik LoadBalancer 10.43.16.86 10.220.179.140,10.220.179.253 80:30399/TCP,443:31096/TCP 125m
~# kubectl get all --all-namespaces
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get replicationcontrollers)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get services)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get daemonsets.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get deployments.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get replicasets.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get statefulsets.apps)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get horizontalpodautoscalers.autoscaling)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get jobs.batch)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get cronjobs.batch)
Of course, the --no-flannel
trick no longer works because the boot up is now stopped by the APIServer waiter.
Here is some additional information which may or may not be useful for this case:
~# cat /etc/systemd/system/k3s.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Install]
WantedBy=multi-user.target
[Service]
Type=notify
EnvironmentFile=/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
server \
~# crictl --runtime-endpoint unix:///var/run/k3s/containerd/containerd.sock images
IMAGE TAG IMAGE ID SIZE
docker.io/rancher/coredns-coredns 1.8.0 a0ce6ab869a69 11.9MB
docker.io/rancher/klipper-helm v0.4.3 0bdabf617c29a 47.7MB
docker.io/rancher/klipper-lb v0.1.2 7d23a14d38d24 2.58MB
docker.io/rancher/library-traefik 2.4.2 253e6b02a96a7 27.1MB
docker.io/rancher/local-path-provisioner v0.0.19 1e695755cc09d 12.7MB
docker.io/rancher/metrics-server v0.3.6 d24dd28770a36 10.2MB
docker.io/rancher/pause 3.1 e11a8cbeda868 231kB
~# crictl --runtime-endpoint unix:///var/run/k3s/containerd/containerd.sock ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
91e1cd4428fb9 d24dd28770a36 Less than a second ago Created metrics-server 7 29403bfaa3bd9
2eeaa1dc81238 253e6b02a96a7 2 minutes ago Running traefik 3 00456951ebfe4
5ec982f2a8bfa 1e695755cc09d 2 minutes ago Exited local-path-provisioner 10 38357e8dd61a5
3c1adbeafaaec 7d23a14d38d24 2 minutes ago Running lb-port-443 3 7bec24357cb31
c1d6c2f9a9556 7d23a14d38d24 2 minutes ago Running lb-port-80 3 7bec24357cb31
f6f68a4a980e2 d24dd28770a36 2 minutes ago Exited metrics-server 6 29403bfaa3bd9
fa8aeb910f29e a0ce6ab869a69 2 minutes ago Running coredns 3 467757cf44fb1
7d5c21df93ecd 7d23a14d38d24 44 minutes ago Exited lb-port-443 2 06a3b1553fedc
a8db82089690b 7d23a14d38d24 44 minutes ago Exited lb-port-80 2 06a3b1553fedc
d1823cf986ede 253e6b02a96a7 44 minutes ago Exited traefik 2 a8d5c4f1fb017
1f29f7922d8c1 a0ce6ab869a69 44 minutes ago Exited coredns 2 ffc95c04b46a1
70bf79328349d 0bdabf617c29a 2 hours ago Exited helm 2 9fc123776a44f
2cee92c5db469 0bdabf617c29a 2 hours ago Exited helm 0 2536ab7078e44
Now, I appreciate the effort been put onto this. As I understood it, this problem only impact those devices which are too old&slow to run Kubernetes anyway. Add in the fact that you guys don't have the exact device, and there are only 2 reports related to this (, and the another guy stopped responding long ago), I think maybe it's not a problem worth fixing. I mean, I will not feel pissed if it comes to that.
So... that's all from me so far. Again, Thank you!
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
I'm not sure why this particular item blocks on slow nodes. It's basically responsible for ensuring that all the core RBAC exists every time the cluster starts up; my guess is that it fails in some non-recoverable way if etcd is running slowly.
Is there anything for me to do/test in order to help the rbac situation? Before I completely tear down the test cluster setup as well as my broken little heart? 🙃
If you start with increased verbosity (--v=2 should do it I think) it will tell you why the rbac hook is not ready when it gets to that point. This is all deep core Kubernetes code though and not likely to be something we can fix directly.
Understood.
I don't really have enough knowledge to debug the inner workings of Kubernetes. So now I'm just assume it's either the hardware cannot run Kubernetes or me been too dumb to operate it.
I'll continue experiment on this cluster to figure out things a little bit more, but I don't think there is anything worth reporting any more.
Here's some little more information that I've collected during my test, I think I'll just paste them there to help the future humans to learn my struggle and stupidities for their own entertainment should they visit the North Pole to see the artifacts. Because it probably won't help others much. Anyways...
The log that I captured during the crash cycles: k3s-service-crashes.log
Here's me calling kubectl get nodes
repeatedly, it seems the API Server worked for few minutes before been closed:
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 35m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 142m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
root@cubie0:~# kubectl get nodes
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
root@cubie0:~# kubectl get nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request
root@cubie0:~# kubectl get nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 36m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 143m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 36m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 143m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 36m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 143m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 36m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 143m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 36m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 143m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 36m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 143m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 143m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 37m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 144m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie1 Ready <none> 38m v1.20.4+k3s-8ace8975
cubie0 Ready control-plane,master 145m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie0 Ready control-plane,master 145m v1.20.4+k3s-8ace8975
cubie1 Ready <none> 38m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cubie0 Ready control-plane,master 145m v1.20.4+k3s-8ace8975
cubie1 Ready <none> 38m v1.20.4+k3s-8ace8975
root@cubie0:~# kubectl get nodes
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
root@cubie0:~# kubectl get nodes
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
root@cubie0:~# kubectl get nodes
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
root@cubie0:~#
And the output of kubectl describe nodes
when it works:
root@cubie0:~# kubectl describe nodes
Name: cubie1
Roles: <none>
Labels: beta.kubernetes.io/arch=arm
beta.kubernetes.io/instance-type=k3s
beta.kubernetes.io/os=linux
k3s.io/hostname=cubie1
k3s.io/internal-ip=10.220.179.140
kubernetes.io/arch=arm
kubernetes.io/hostname=cubie1
kubernetes.io/os=linux
node.kubernetes.io/instance-type=k3s
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"d2:1b:14:37:80:91"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.220.179.140
k3s.io/node-args: ["agent"]
k3s.io/node-config-hash: N624NCONM6NMLAYPK4RLWKY52UNBEJFLO2JFY7Q6NP6QEBN6GZYA====
k3s.io/node-env:
{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/3f87137d37e82a44b14b1e280186ecc6b29bf888a9730cc6c907a4c68426b5d4","K3S_TOKEN":"********","K3S_U...
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 14 Mar 2021 21:46:11 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: cubie1
AcquireTime: <unset>
RenewTime: Sun, 14 Mar 2021 22:57:07 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sun, 14 Mar 2021 21:46:21 +0800 Sun, 14 Mar 2021 21:46:21 +0800 FlannelIsUp Flannel is running on this node
MemoryPressure False Sun, 14 Mar 2021 22:52:56 +0800 Sun, 14 Mar 2021 21:46:07 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 14 Mar 2021 22:52:56 +0800 Sun, 14 Mar 2021 21:46:07 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 14 Mar 2021 22:52:56 +0800 Sun, 14 Mar 2021 21:46:07 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 14 Mar 2021 22:52:56 +0800 Sun, 14 Mar 2021 21:46:16 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.220.179.140
Hostname: cubie1
Capacity:
cpu: 2
ephemeral-storage: 7458672Ki
memory: 1022940Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 7255796116
memory: 1022940Ki
pods: 110
System Info:
Machine ID: 59a23be25c384437ad3f08c9585a8d94
System UUID: 59a23be25c384437ad3f08c9585a8d94
Boot ID: e4867dfb-de3f-4eb7-91d2-f2eb47635e81
Kernel Version: 5.10.16-sunxi
OS Image: Armbian 21.02.2 Buster
Operating System: linux
Architecture: arm
Container Runtime Version: containerd://1.4.3-k3s3
Kubelet Version: v1.20.4+k3s-8ace8975
Kube-Proxy Version: v1.20.4+k3s-8ace8975
PodCIDR: 10.42.1.0/24
PodCIDRs: 10.42.1.0/24
ProviderID: k3s://cubie1
Non-terminated Pods: (1 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system svclb-traefik-l6z64 0 (0%) 0 (0%) 0 (0%) 0 (0%) 72m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 72m kubelet Starting kubelet.
Warning InvalidDiskCapacity 72m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 72m (x2 over 72m) kubelet Node cubie1 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 72m (x2 over 72m) kubelet Node cubie1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 72m (x2 over 72m) kubelet Node cubie1 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 72m kubelet Updated Node Allocatable limit across pods
Normal Starting 72m kube-proxy Starting kube-proxy.
Normal NodeReady 72m kubelet Node cubie1 status is now: NodeReady
Name: cubie0
Roles: control-plane,master
Labels: beta.kubernetes.io/arch=arm
beta.kubernetes.io/instance-type=k3s
beta.kubernetes.io/os=linux
k3s.io/hostname=cubie0
k3s.io/internal-ip=10.220.179.253
kubernetes.io/arch=arm
kubernetes.io/hostname=cubie0
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=true
node-role.kubernetes.io/master=true
node.kubernetes.io/instance-type=k3s
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"9a:7f:74:c6:35:76"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.220.179.253
k3s.io/node-args: ["server","--v","2"]
k3s.io/node-config-hash: VKQF3MXOHEN3R6NWURA5F3LPB262NQABQCUONGKU3RRZWIRVTJTQ====
k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/3f87137d37e82a44b14b1e280186ecc6b29bf888a9730cc6c907a4c68426b5d4"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 14 Mar 2021 19:59:17 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: cubie0
AcquireTime: <unset>
RenewTime: Sun, 14 Mar 2021 22:57:02 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sun, 14 Mar 2021 22:56:36 +0800 Sun, 14 Mar 2021 22:56:36 +0800 FlannelIsUp Flannel is running on this node
MemoryPressure False Sun, 14 Mar 2021 22:56:55 +0800 Sun, 14 Mar 2021 20:04:44 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 14 Mar 2021 22:56:55 +0800 Sun, 14 Mar 2021 20:04:44 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 14 Mar 2021 22:56:55 +0800 Sun, 14 Mar 2021 20:04:44 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 14 Mar 2021 22:56:55 +0800 Sun, 14 Mar 2021 20:04:44 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.220.179.253
Hostname: cubie0
Capacity:
cpu: 2
ephemeral-storage: 60191424Ki
memory: 1022940Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 58554217222
memory: 1022940Ki
pods: 110
System Info:
Machine ID: 59a23be25c384437ad3f08c9585a8d94
System UUID: 59a23be25c384437ad3f08c9585a8d94
Boot ID: d02739f8-8ee0-4d4d-befa-73d70d405b4e
Kernel Version: 5.10.16-sunxi
OS Image: Armbian 21.02.2 Buster
Operating System: linux
Architecture: arm
Container Runtime Version: containerd://1.4.3-k3s3
Kubelet Version: v1.20.4+k3s-8ace8975
Kube-Proxy Version: v1.20.4+k3s-8ace8975
PodCIDR: 10.42.0.0/24
PodCIDRs: 10.42.0.0/24
ProviderID: k3s://cubie0
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system svclb-traefik-7ffdx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 171m
kube-system metrics-server-86cbb8457f-x9bxt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 177m
kube-system coredns-854c77959c-4kg9l 100m (5%) 0 (0%) 70Mi (7%) 170Mi (17%) 177m
kube-system traefik-8469c8586b-cj7c8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 172m
kube-system local-path-provisioner-5ff76fc89d-v58mg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 177m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 100m (5%) 0 (0%)
memory 70Mi (7%) 170Mi (17%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning InvalidDiskCapacity 64m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 64m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 64m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 64m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 64m kubelet Updated Node Allocatable limit across pods
Normal Starting 63m kube-proxy Starting kube-proxy.
Normal NodeHasNoDiskPressure 61m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 61m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 61m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal Starting 60m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 59m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 59m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 59m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal Starting 59m kube-proxy Starting kube-proxy.
Warning InvalidDiskCapacity 57m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 57m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 57m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal Starting 57m kube-proxy Starting kube-proxy.
Normal NodeHasNoDiskPressure 56m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 56m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeHasSufficientMemory 56m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Warning InvalidDiskCapacity 56m kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 56m kubelet Updated Node Allocatable limit across pods
Normal Starting 55m kube-proxy Starting kube-proxy.
Warning InvalidDiskCapacity 54m kubelet invalid capacity 0 on image filesystem
Warning InvalidDiskCapacity 51m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 51m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal Starting 50m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 49m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 49m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Warning InvalidDiskCapacity 49m kubelet invalid capacity 0 on image filesystem
Normal Starting 49m kube-proxy Starting kube-proxy.
Warning InvalidDiskCapacity 47m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 47m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal Starting 47m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 46m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 46m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeHasNoDiskPressure 46m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 45m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 44m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 44m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 44m kubelet Node cubie0 status is now: NodeHasSufficientPID
Warning InvalidDiskCapacity 44m kubelet invalid capacity 0 on image filesystem
Normal Starting 43m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientPID 42m kubelet Node cubie0 status is now: NodeHasSufficientPID
Warning InvalidDiskCapacity 42m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 42m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 42m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 42m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 40m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 40m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 40m kubelet Node cubie0 status is now: NodeHasSufficientPID
Warning InvalidDiskCapacity 38m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 38m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 38m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeHasNoDiskPressure 38m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 38m kubelet Updated Node Allocatable limit across pods
Normal Starting 38m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 35m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 35m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 35m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 35m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientPID 32m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeHasSufficientMemory 32m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 32m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 32m kubelet Updated Node Allocatable limit across pods
Warning InvalidDiskCapacity 31m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 31m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 31m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 31m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 30m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientPID 29m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeHasNoDiskPressure 29m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 29m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeAllocatableEnforced 28m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 27m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 27m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 27m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 26m kubelet Updated Node Allocatable limit across pods
Normal Starting 26m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientPID 25m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeHasSufficientMemory 25m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 25m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 25m kubelet Updated Node Allocatable limit across pods
Normal Starting 24m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 23m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 23m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 23m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 23m kubelet Updated Node Allocatable limit across pods
Normal Starting 22m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientPID 21m kubelet Node cubie0 status is now: NodeHasSufficientPID
Warning InvalidDiskCapacity 21m kubelet invalid capacity 0 on image filesystem
Normal NodeHasNoDiskPressure 21m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 21m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal Starting 20m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 19m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 19m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 19m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal Starting 19m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 17m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 17m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 17m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 17m kubelet Updated Node Allocatable limit across pods
Normal Starting 17m kube-proxy Starting kube-proxy.
Warning InvalidDiskCapacity 16m kubelet invalid capacity 0 on image filesystem
Normal NodeHasNoDiskPressure 16m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 16m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 16m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 15m kubelet Updated Node Allocatable limit across pods
Normal Starting 15m kube-proxy Starting kube-proxy.
Normal NodeHasSufficientMemory 13m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 13m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 13m kubelet Node cubie0 status is now: NodeHasSufficientPID
Warning InvalidDiskCapacity 13m kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 13m kubelet Updated Node Allocatable limit across pods
Normal NodeHasNoDiskPressure 11m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 11m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeHasSufficientMemory 11m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeAllocatableEnforced 11m kubelet Updated Node Allocatable limit across pods
Normal Starting 11m kube-proxy Starting kube-proxy.
Warning InvalidDiskCapacity 10m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 10m kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 10m kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 10m kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 9m56s kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 8m11s kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 8m11s kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal Starting 7m37s kube-proxy Starting kube-proxy.
Normal NodeHasSufficientPID 6m27s kubelet Node cubie0 status is now: NodeHasSufficientPID
Warning InvalidDiskCapacity 6m27s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 6m27s kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 6m27s kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 6m21s kubelet Updated Node Allocatable limit across pods
Normal Starting 5m51s kube-proxy Starting kube-proxy.
Warning InvalidDiskCapacity 4m32s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 4m32s kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 4m32s kubelet Node cubie0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 4m32s kubelet Node cubie0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 4m27s kubelet Updated Node Allocatable limit across pods
Warning InvalidDiskCapacity 2m33s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 2m33s kubelet Node cubie0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m33s kubelet Node cubie0 status is now: NodeHasNoDiskPressure
@brandond I'm moving this back to "Working" in case there's anything we find that may help, and as time permits I'll see if there's anything else I can do to reproduce this. My thought is finding a way to set the healthz timeout to be short enough for these not to finish in time and therefore fail. If I had to guess, I'm going to try with --shutdown-delay-duration
as an apiserver arg and gradually increase it from 1 second just to see what happens.
@rancher-max lets keep this in to-verify until we have a solid way to reproduce this that doesn't involve running it on excessively resource constrained platforms that probably won't work anyway.
I seem to have solved this problem, the memory and cpu limited by kubelet are too small
@1998729 Can you share a little bit more detail? I'm interested to test it at my end. Thanks!
@1998729 Can you share a little bit more detail? I'm interested to test it at my end. Thanks!
kubelet systemd start params disabled cpu and memory limits
I just wanted to report my case. I have had a resource constrained node (2GB RAM, 1.6GHz N2600) on which this problem would appear very often, but sometimes fix itself. After upgrading to 1.21 it works without problem.
After upgrading to 1.21 it works without problem.
That sounds great. Unfortunately I don't have devices to test it anymore, my old boards... let's just say some of them don't have RAM chips on them anymore (so as a few solder pads for those chips).
Can we just assume this problem is solved until new issue arises?
Thanks!
Originally posted by @nirui in https://github.com/k3s-io/k3s/issues/2509#issuecomment-786573486
Not sure if my issue was related.
But I got the same
flannel exited: failed to acquire lease: nodes "<Node Name>" is forbidden: not yet ready to handle request
error after a reboot, and then the cluster keeps crashing.I fixed the problem by disabling the builtin
flannel
and install my own instead. Basically, I changed it to following in myk3s.service
:and then, when the cluster started up, install https://github.com/containernetworking/plugins/releases and run
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
I don't think this problem was related with the speed of the SD cards, because I used the same card, and the problem just gone after I disabled the built in
flannel
.Maybe this will kick you smart & nice guys to look into this issue a little more? blink blink I mean...maybe the built in
flannel
should ... you know, just keep retrying instead of just crash? :D