Closed llajas closed 7 months ago
As an attempted workaround, I tried to adjust the inventory by moving the initial node to the bottom of the master list. Setup went through, but the result was unfortunately the same where the node appears to have created it's own cluster.
It sounds as if the token that is generated at provisioning might only be unique to the first node? I'm not sure. I'll likely be rebuilding later this evening.
Moving the node to the bottom is the recommended method https://homelab.khuedoan.com/how-to-guides/add-or-delete-nodes (to replace a master node, remove it, then move it to the bottom and re-apply)
You can use a static cluster token and encrypt it with Ansible Vault, however I decided to create the token on the first node so we don't have to specify one. We can check to see if we have the token on other nodes, but this will quickly increase the complexity.
Closing as it is working as intended, but I'm happy to review a PR if someone wants to take a crack at improving it (without providing a token manually).
Describe the bug
When rebuilding the first node in the inventory, node will go rogue, create its own cluster away from the rest of the inventory and cause routing conflicts when in the same vlan/subnet.
To reproduce
Hey Khue. Thank you for all the work on this repo - I've been running it for quite some time and it's become my playground to explore new tech. Wanted to advise on an issue I've found myself in and seek some guidance, if possible.
If/when the first node in the inventory fails, a subsequent re-run of
make
will cause the node to create its own cluster away from the rest of the inventory that is already running. This scenario would also occur if you were to simply shutdown the first node in the inventory and then run 'make' - The node creates its own cluster on the same subnet causing conflicts with addressing.Steps to reproduce the behavior:
make
.kubectl
from the tools container will show a lone node in its own cluster; Trying to runkubectl
from my controller where I do my management results in multiple x509 certificate errors since the kube-context is no longer valid.k3s kubectl get nodes
will show the first node asNotReady,SchedulingDisabled
.Expected behavior
First node would rejoin existing cluster and not create its own cluster.
While I understand that this is due to the below Ansible steps, having some sort of catch here would be nice/convenient.
Given that there would not be a kube-context if it's the first run, detection may pose a problem, if that makes sense. If anything, I'm sure an adjustment of the Ansible to avoid the above 3 steps and instead copy the token from any other node, but I'm honestly not quite versed in Ansible enough to adjust things comfortably. Any ideas here? Thanks in advance for your time.
Screenshots
N/A
Additional context
4 × Lenovo ThinkCentre M700:
Per Node CPU: Intel Core i5-6600T @ 2.70GHz RAM: 16GB DDR4 SSD: 512GB
OS is Fedora 38.