monogon-dev / monogon

The Monogon Monorepo. May contain traces of peanuts and a ✨pure Go Linux userland✨. Work in progress!
https://monogon.tech
Apache License 2.0
378 stars 9 forks source link

node: Adding KubernetesWorker to a node in New state blocks rollout of other nodes #234

Closed fionera closed 1 year ago

fionera commented 1 year ago

When setting up a new cluster you can currently add the KubernetesWorker role to a Node that is still in New state. That will prevent other nodes to join the k8s cluster. Approving these new Nodes will fix it.

Solution: block adding roles to nodes in new state.

q3k commented 1 year ago

The following was observed on production:

  1. N nodes registered into the cluster, but not all off them have been approved.
  2. All N nodes had the KubernetesWorker role added to them.
  3. No new nodes appeared in kubectl, including none of those that has KW applied and which were approved.
  4. Once all of the nodes have then been approved, the nodes which we expected to see in 3 did appear in the end

In other words:

What we expected to happen:

All nodes that have the KW role and are approved should have appeared in kubectl at step 3.

What we observed:

Nodes got added to kubectl only after there were no more nodes with both KW and Unapproved.

Interpretation:

Having nodes with KW and Unapproved might block some logic from making progress in adding nodes to kubectl.

q3k commented 1 year ago

I'm not able to replicate this in a test.

This is the scenario I'm using:

  1. I create a cluster with three nodes. First node A is the initial bootstrap node, also running the Kube control plane. Nodes B and C are not yet approved, ie. are in NEW.
  2. I add a KubernetesWorker role to node B. This succeeds. Node B is now NEW, KW.
  3. I approve node C. This succeeds. Node C is now UP.
  4. I add a KubernetesWorker role to node C. This succeeds. Node C is now UP, KW.
  5. Node C starts kubernetes services appears in kubectl. I expected it to not appear, as outlined above.

@fionera Does this match what you've seen in prod, or am I misunderstanding the report?

q3k commented 1 year ago

I also tried swapping steps 2 and 3 and that also works.

q3k commented 1 year ago

@fionera Have you observed this behaviour during the newest cluster re-deployments?

q3k commented 1 year ago

Closing this as this wasn't able to be replicated and didn't happen in production again.