k0sproject / k0smotron

k0smotron
https://docs.k0smotron.io/
Other
420 stars 39 forks source link

RemoteMachine Setup: K0smotronControlPlane Etcd stuck restarting due to Rampant Reconciles #617

Open prannonpendragas opened 3 weeks ago

prannonpendragas commented 3 weeks ago

I'm running a single node K0s on a baremetal machine in my home lab.

I've installed k0smotron on this node with the intent to use it as a control plane provider to RemoteMachines that I install via Metal3.

I've installed k0smotron on my system following the guide here: https://docs.k0smotron.io/v1.0.0/install/

I've tried to follow the guide at https://docs.k0smotron.io/v1.0.0/capi-remote/ to set up a Cluster, K0smotronControlPlane, RemoteCluster, machine, k0sworkerconfig, and remotemachine objects to:

  1. Set a k0smotron control plane that lives in pods on my single node k0s,
  2. Connect the remote machine as a worker to this control plane.

When I do this, I find that the control plane pod and etcd pod launch but never finish starting because the various k0smotron-controller-managers rampantly restart etcd in a reconciliation failure loop.

What I gather is that when the pods launch, the various controller managers try to talk to it or talk to the k0s cluster API.

  1. The k0s cluster has a self-signed cert and I get x509 errors as a result of that, triggering a reconciliation loop that constantly deletes and recreates etcd.
  2. And the etcd can never start, so when k0smotron tries to talk to the K0smotronControlPlane API it gets connections refused, triggering a reconciliation loop that constantly deletes and recreates etcd.

I haven't been able to figure out how to deal with this and I am not sure what I am doing wrong. I haven't found configuration options for disabling TLS verification, and it seems incorrect that a reconcile would cause the rampant recreation of the etcd pod preventing any kind of success condition that would end the loop.

I'm happy to provide logs, information on my setup, and so forth to figure this out.

prannonpendragas commented 3 weeks ago

I think I figured out the problem. I was installing v1.0.0 of k0smotron, but the clusterctl command was installing v0.9.6 of control-plane, bootstrap, and infrastructure modules.

I rolled my install back to v0.9.6 of k0smotron and got things consistent, and now my control plane seems to be working as expected.

I still welcome any comments in case there is anything of interest in my report.

jnummelin commented 3 weeks ago

The k0s cluster has a self-signed cert and I get x509 errors as a result of that

You mean the mgmt cluster (the single node one) has self signed cert? Where did you see the x509 errors, in k0smotron controller(s)? k0smotron, as it's running in pod(s), uses normal service account access thus it should get the CA to trust automatically injected. Dunno how that could result in x509 errors 🤔

I haven't found configuration options for disabling TLS verification

You should not need to do this for the "child" controlplanes. What happens is that k0smotron/CAPI creates all the needed certs and also the kubeconfig. The kubeconfig is generated in a way where it has the needed CA in place and thus the clients should always trust it.

I was installing v1.0.0 of k0smotron, but the clusterctl command was installing v0.9.6 of control-plane, bootstrap, and infrastructure modules

hmm, not sure I read this correctly but did you essentially have two k0smotron setups installed, one with kubectl apply ... and one with clusterctl init ...?

clusterctl command was installing v0.9.6...

oh, that reminds me: we need to figure out how to automate updating of the clusterctl metadata.yaml file, too easy to forget to update it for every release 🤦

prannonpendragas commented 2 weeks ago

hmm, not sure I read this correctly but did you essentially have two k0smotron setups installed, one with kubectl apply ... and one with clusterctl init ...?

I think yes, this is essentially what I did. I am installing using the install.yaml, and I'm also installing additional components using clusterctl init. I end up with these four pods.

[22:41:35][root]@[worthy-aquarium-08330][~]$ kubectl -n k0smotron get pods
NAME                                                          READY   STATUS    RESTARTS   AGE
k0smotron-controller-manager-75bfdfdfdb-w6ss6                 2/2     Running   0          33h <---- created by install.yml
k0smotron-controller-manager-bootstrap-545bc97cfc-fp5p9       2/2     Running   0          33h <---- created by clusterctl bootstrap
k0smotron-controller-manager-control-plane-56669586df-2vk5c   2/2     Running   0          33h <---- created by clusterctl control-plane
k0smotron-controller-manager-infrastructure-8f8547c76-597hh   2/2     Running   0          33h <---- created by clusterctl infrastructure

I was following the guide at https://docs.k0smotron.io/stable/install/#full-installation.

I did this because when I did the "full" install via the install.yaml, I was missing CRDs that would allow me to properly set up remoteMachines. I think I was missing the clusters.cluster.x-k8s.io/v1beta1 CRD in particular. I can't remember exactly, though; I'd need to rerun my whole setup.

Ultimately, since I have a working config now, I think that I might have been creating some sort of weird version conflict between the component installed with install.yaml and the other components installed with clusterctl. It's also entirely possible that I'm doing this completely wrong and misinterpreting the installation instructions.

Kinda leads me to a few vague questions:

  1. Why are CRDs "missing" from the "full" install.yaml? Is the "full" install actually full?
  2. Am I improperly duplicating my install with unnecessary stuff?
makhov commented 2 weeks ago

Hello!

  1. The k0smotron (both in full and in cluster-api installations) provides only its own CRDs. But, when you run clusterctl init the clusterctl installs also Cluster API core components and CAPI CRDs, including clusters.cluster.x-k8s.io/v1beta1
  2. Potentially, it can cause some issues, since two controllers are watching the same resources. It's safe to just delete k0smotron-controller-manager deployment.

Also, we've updated the metadata.yaml file, so you can upgrade Cluster API components. I think, re-running the init command should be enough:

clusterctl init --bootstrap k0sproject-k0smotron \
                --control-plane k0sproject-k0smotron \
                --infrastructure k0sproject-k0smotron