Unhealthy controller manager and scheduler after leaving it running overnight

rubenmch commented 6 years ago

rancher/rancher: 2.0.2 single node provider: Digital Ocean (1 etcd/control node, 3 worker nodes, all 1GB Ram 1 vcpu 25GB disk, ubuntu 16.04 x64)

Steps to Reproduce:

Created a Digital Ocean cluster with 1 etcd/control node, 3 worker nodes, all 1GB Ram 1 vcpu 25GB disk, ubuntu 16.04 x6
Deployed rabbitmq-ha from the helm catalog
Added an ingress for rabbitmq-ha, tested and everything was working ok
Left it running overnight
In the morning the cluster said Controller Manager and Scheduler are unhealthy and new workloads aren't deployed
After rebooting the etcd/control node the Controller Manager and Scheduler are no longer unhealthy but now a message saying Alert: Component etcd-0 is unhealthy appeared.
In the agent and server logs were multiple Timeout messages
While the controller and scheduler were marked as unhealthy I was able to access healthz from the etcd/control node

superseb commented 6 years ago

The docs state that starting with a small cluster requires 2 vCPUs and 4GB RAM, you should provision it uses these specifications and try again. For the next time, please include the logging generated by the cluster components so we can take a look what is happening in your cluster.

rubenmch commented 6 years ago

You're right, I thought that requirement was for the rancher server but the documentation clearly states that it's for the nodes.

I upgraded to the 2cpu 4gb drops and increased the etcd/control nodes to 3 nodes and it hasn't happened again.

apricote commented 6 years ago

We experienced the same issue, albeit with the required 2 vCPUs and 4GB RAM.

Rancher:
- Version: v2.0.2
Cluster:
- Provider: Custom Docker Machine Driver found here https://github.com/JonasProgrammer/docker-machine-driver-hetzner
- Kubernetes 1.10.1
- Canal Networking
Nodes:
- 1x control,etcd
- 1x worker

The containers kube-controller-manager as well as kube-scheduler still exist on the machine and are included in the docker ps output, but if I try to get the executing processes with docker top kube-controller-manager there is an error message rpc error: code = 2 desc = containerd: container not found.

This happenend yesterday, 2018-06-28 at around 12:00+0200.

kube-controller-manager logs (tailing):

I0619 09:49:15.628311       1 wrap.go:42] GET /healthz: (5.423625ms) 200 [[kube-probe/1.10] 127.0.0.1:32952]
I0619 09:49:30.628928       1 wrap.go:42] GET /healthz: (90.227µs) 200 [[kube-probe/1.10] 127.0.0.1:33158]
I0619 09:49:45.630424       1 wrap.go:42] GET /healthz: (23.772µs) 200 [[kube-probe/1.10] 127.0.0.1:33376]
I0619 09:50:00.636317       1 wrap.go:42] GET /healthz: (23.922µs) 200 [[kube-probe/1.10] 127.0.0.1:33584]
I0619 09:50:15.639215       1 wrap.go:42] GET /healthz: (28.943µs) 200 [[kube-probe/1.10] 127.0.0.1:33868]
I0619 09:50:30.639719       1 wrap.go:42] GET /healthz: (34.737µs) 200 [[kube-probe/1.10] 127.0.0.1:34074]

kube-scheduler logs (tailing):

E0628 11:34:33.225815       1 leaderelection.go:258] Failed to update lock: Put https://127.0.0.1:6443/api/v1/namespaces/kube-system/endpoints/kube-scheduler: stream error: stream ID 1444503; INTERNAL_ERROR
E0628 12:13:43.108500       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-scheduler: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints kube-scheduler)
E0628 12:13:43.230833       1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'master1_2343ee96-7451-11e8-8e86-9600000b0cbf stopped leading'
I0628 12:13:43.379652       1 leaderelection.go:203] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0628 12:13:43.462217       1 server.go:612] lost master
F0628 12:13:43.667364       1 helpers.go:119] error: lost lease

share-mnt logs (tailing):

Found container ID: 9a9cf9501bc2c5536676864e961dcd5b183bfebc3d38bd14317a9c9adcec7d78
Checking root: /host/run/runc
Checking file: 09bdbb652022175a7128c4ff60c2e50c600d0fc45f840e48789723e0c2e0e810
Checking file: 10044e43131371224d471aaa8d659875b4b6a53bf5c9167daca49b404e199917
Checking file: 1b635ec592989c4fadc2b44e87e6fa4dbb4de1a15010c8558a282e1a2fd5453f
Checking file: 1debdcd5f511cd21331b35453e6d4c33da3894418dbc860e2273712bfd19c908
Checking file: 24c2aa1f5c44ff5b2108a26dab82a24f7359ba2047e2eb73562f356f71e2843f
Checking file: 37e79fd1d99c122e847c0ba45b9ad7337e7d7e5eb48fb220ae32dbd1a5802534
Checking file: 39525e26714f58029fed1358fbb3571ff848950c1b114432ce5f3344627ab443
Checking file: 44de5e828bb13b4199b478424b03996de2ed0b1dcfa03548d53e089eb41e9b0d
Checking file: 49f7a0d136d4fb0ed552f5dc4ccc3f0bf5021846b1e2a4049711da1628159611
Checking file: 4e07f3125133a553a2319ecc3c87b46504c91572f716f931f31dc499b20ba649
Checking file: 5d08596c090d557ab53a0091bce036335a32b0c7d5c026221e3449a250891836
Checking file: 68084de558c13970431d85c97b5e541b57e3b886127bf173c23bf04893f16219
Checking file: 6c116dc30fa3e304245f0eb60265f6170bec73a999d37432bdd4d20735f4ba6a
Checking file: 71739ec3c4c8ab37d6ddd359f03df87cdef3823111d53571879859574daeb138
Checking file: 73ffb489186028777e5c8f2d1d97ec5774667f1faec57d2f4b1e6e13b9dc9b09
Checking file: 74b48b4cabb63e7ee048822d1fb95e20c0632c38ec2d285ddc01a894016d1e10
Checking file: 80793593c1522d1b6a88c54c1653c65c9e789021a3ed2e14e0506cb2320dc94f
Checking file: 85cbca796c208e7959d13894658d247a546a8b1e2bf0640e0245aa1ee82b4f0f
Checking file: 8b66663c4a82900bc562548fa5448d7652e6d108acec307846cbe2ac57b4f16d
Checking file: 8d5883768a90666dd4f8afe4fc82f354474b85767845c336cb7082815c06190e
Checking file: 8e4b5dc4368e5a17b9fda7ecd7517d753fc8c15edb42a66debf6c5e5cc75c95b
Checking file: 91dc594fed216455e881ddf4968899241ad7c7db8632fc2f709b42a8ad9a641a
Checking file: 9a5ffc543cb8863cda6a6c7cdb1bcd2830de85c44661574fd9956518f6dad703
Checking file: 9a9cf9501bc2c5536676864e961dcd5b183bfebc3d38bd14317a9c9adcec7d78
Found state.json: 9a9cf9501bc2c5536676864e961dcd5b183bfebc3d38bd14317a9c9adcec7d78
time="2018-06-28T11:19:33Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/22207/ns/mnt -F -- /var/lib/docker/overlay/05350e372638dff6ddbe6fd8ed078180ecf766243bc57df8ce3e5b59208402bd/merged/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]" 
kubelet
docker: Error response from daemon: {"message":"grpc: the connection is unavailable"}.
kubelet
Failed to kill container (d062512bd07c1e4135a16f13b86cd752fd4911894803a7633f7ce0f60f164b74): Error response from daemon: {"message":"Cannot kill container d062512bd07c1e4135a16f13b86cd752fd4911894803a7633f7ce0f60f164b74: Cannot kill container d062512bd07c1e4135a16f13b86cd752fd4911894803a7633f7ce0f60f164b74: rpc error: code = 14 desc = grpc: the connection is unavailable"}

rubenmch commented 6 years ago

@apricote You can try upgrading to rancher 2.0.4. Rancher 2.0.2 used control nodes as worker nodes to schedule workload pods. That could cause the etcd/control node to run out of memory. I'm currently using 3 all purpose (etcd/control/worker) nodes and haven't had the issue again.

apricote commented 6 years ago

Upgrade is scheduled for next week.

Edit: Restarting the node solved the issue. kube-controller-manager and kube-scheduler are healthy now.

hadifarnoud commented 5 years ago

@apricote I restarted all the worker nodes. it worked. no idea when this will happen again

mrochcn commented 5 years ago

the same issue

valerioleo commented 5 years ago

I keep having the same issue, although it is not an "overnight" one, but it can happen without (apparent) reason. Another problem I face is that I couldn't find a way to restart the unhealthy parts (Controller Manager and Scheduler). I always had to recreate a whole new cluster, do you know if it is possible to restart the unhealthy components only?

Mapiac commented 5 years ago

Can this be re-opened? I am curious why it was closed with no resolution?

jwuman commented 5 years ago

I did restart the etcd container on each host that has the unhealthy error and that clears the alert.

matfiz commented 5 years ago

I have faced the same issue, all min requirements are met. The only solution for me was to reboot each host. There is also no rule - the Scheduler can become unhealthy overnight, or in the middle of the day. Is there a way to restart it automatically when it becomes unhealthy?

Image 2019-09-12 at 9 33 21 AM

hadifarnoud commented 5 years ago

that would be a bad solution @matfiz. I have the same issue. one node became unhealthy overnight on a cluster I don't use.

superseb commented 5 years ago

This issue was closed because the node specs were not up to par with the requirements. After that there have been some replies indicating scheduler failing, kube-controller-manager failing, both failing or the node not responding. Without info requested from the template there is no way to investigate what is going on or if its possibly fixed in newer versions. Regarding scheduler only, it does not exit automatically on failure, this will be fixed in k8s 1.16 (https://github.com/kubernetes/kubernetes/pull/81306). There is another issue with using componentstatuses in general, which does not always show the current state of the cluster (when more than 1 controlplane node is present).

If you encounter issues similar to this, please file a new issue with all the requested info so we can look into it and investigate. In case of the screenshot above, your smallest node is running the most important components of the cluster which on paper may seem enough, but depending on the amount of resources (deployments/events etc) this might not be enough, without this info it's hard to determine. Providing full logs of all components involved usually helps as it will give an indication when the issue started to appear, and in case of a resource problem, system logging and metrics will also help in the investigation.

rancher / rancher

Unhealthy controller manager and scheduler after leaving it running overnight #14036