khuedoan / homelab

Fully automated homelab from empty disk to running services with a single command.
https://homelab.khuedoan.com
GNU General Public License v3.0
7.9k stars 705 forks source link

Upon rebuilding, first node in inventory.yaml doesn't rejoin existing cluster. #119

Closed llajas closed 7 months ago

llajas commented 10 months ago

Describe the bug

When rebuilding the first node in the inventory, node will go rogue, create its own cluster away from the rest of the inventory and cause routing conflicts when in the same vlan/subnet.

To reproduce

Hey Khue. Thank you for all the work on this repo - I've been running it for quite some time and it's become my playground to explore new tech. Wanted to advise on an issue I've found myself in and seek some guidance, if possible.

If/when the first node in the inventory fails, a subsequent re-run of make will cause the node to create its own cluster away from the rest of the inventory that is already running. This scenario would also occur if you were to simply shutdown the first node in the inventory and then run 'make' - The node creates its own cluster on the same subnet causing conflicts with addressing.

Steps to reproduce the behavior:

  1. On an existing/up and running cluster using this repo, shutdown the first node in the inventory but leave remaining inventory up and going.
  2. Start the tools container and run make.
  3. Bootstrapping will occur as expected - Node WoL boots and obtains ignition.
  4. After several minutes, provisioning will fail, typically at smoke tests.
  5. Running kubectl from the tools container will show a lone node in its own cluster; Trying to run kubectl from my controller where I do my management results in multiple x509 certificate errors since the kube-context is no longer valid.
  6. SSH'ing into any of the other nodes from the inventory and running k3s kubectl get nodes will show the first node as NotReady,SchedulingDisabled.

Expected behavior

First node would rejoin existing cluster and not create its own cluster.

While I understand that this is due to the below Ansible steps, having some sort of catch here would be nice/convenient.

Given that there would not be a kube-context if it's the first run, detection may pose a problem, if that makes sense. If anything, I'm sure an adjustment of the Ansible to avoid the above 3 steps and instead copy the token from any other node, but I'm honestly not quite versed in Ansible enough to adjust things comfortably. Any ideas here? Thanks in advance for your time.

Screenshots

N/A

Additional context

4 × Lenovo ThinkCentre M700:

Per Node CPU: Intel Core i5-6600T @ 2.70GHz RAM: 16GB DDR4 SSD: 512GB

OS is Fedora 38.

llajas commented 10 months ago

As an attempted workaround, I tried to adjust the inventory by moving the initial node to the bottom of the master list. Setup went through, but the result was unfortunately the same where the node appears to have created it's own cluster.

It sounds as if the token that is generated at provisioning might only be unique to the first node? I'm not sure. I'll likely be rebuilding later this evening.

khuedoan commented 10 months ago

Moving the node to the bottom is the recommended method https://homelab.khuedoan.com/how-to-guides/add-or-delete-nodes (to replace a master node, remove it, then move it to the bottom and re-apply)

You can use a static cluster token and encrypt it with Ansible Vault, however I decided to create the token on the first node so we don't have to specify one. We can check to see if we have the token on other nodes, but this will quickly increase the complexity.

khuedoan commented 7 months ago

Closing as it is working as intended, but I'm happy to review a PR if someone wants to take a crack at improving it (without providing a token manually).