k3s-io / k3s-ansible

Apache License 2.0
2.01k stars 802 forks source link

Rebuilding first server in HA setup #269

Closed gmautner closed 10 months ago

gmautner commented 10 months ago

I tried to simulate a problem whereby the first server of an HA setup (the one installed with the --cluster-init flag) is lost and has to be rebuilt from scratch.

I ran k3s-uninstall.sh and delete all k3s data, them re-applied the playbook.

The problem is that, in such a scenario, the lost server had to actually join the remaining ones instead of being reinstalled with --cluster-init. Since this is a scenario in which a server is completely lost, there is no local data for it to realize it was a member of a former cluster. So it bootstraps a new k3s server with etcd and everything else disconnected from the remaining servers.

I understand that there is no way that the ansible playbook can possibly know about that beforehands and decide if it will install k3s with --cluster-init or by joining an existing cluster. Maybe we could add a cluster_init variable to inform if a pre-existing cluster has to be joined or we want to create a new one.

Can anyone think of a simple solution?

dereknola commented 10 months ago

This playbook assumes that the first node listed under server hosts is the init node.

So if you swap the first node with another server node, and then use the ansible-playbook --limit flag, you could trick the playbook into initializing the first node again, but this time as a secondary node. You also need to change the api_endpoint to be one of the other two still up servers.

Initial inventory:

k3s_cluster:
  children:
    server:
      hosts:
        192.16.35.10:
        192.16.35.11:
        192.16.35.12:

vars:         
  api_endpoint: 192.16.35.10

So you uninstall k3s on 192.16.35.10. Now make your inventory look like:

k3s_cluster:
  children:
    server:
      hosts:
        192.16.35.11:
        192.16.35.10:
        192.16.35.12:

vars:         
  api_endpoint: 192.16.35.11

And then run: ansible-playbook --limit 192.16.35.10 ./playbook/site.yml and it should reprovision just 192.16.35.10 as a "joining" server, not the initial one.