syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner šŸš€ Kubernetes Infrastructure as Software šŸ”§ Terraform/Kubespray/kOps alternative for running Kubernetes on Hetzner
https://caph.syself.com
Apache License 2.0
538 stars 51 forks source link

git based management with flux not possible #1136

Open danielr1996 opened 5 months ago

danielr1996 commented 5 months ago

/kind bug

What steps did you take and what happened: I tried managing my clusters with flux to fully embrace GitOps. This would have the advantage of being able to completely wipe the management cluster without any backup and restore the management cluster from git. However when performing the following steps

1) define the ClusterAPI manifests (HetznerCluster, KubeadmControlplane,...) and push them to a git repository

2a) create a bootstrap cluster on kind 2b) install flux and clusterapi on bootstrap cluster 2c) connect the git repository to the bootstrap cluster

--> ClusterAPI defintions get correctly applied and the cluster starts up 3) delete bootstrap cluster 4) repeat 2a),2b),2c) --> ClusterAPI definitions get correctly applied and a second cluster starts up

What did you expect to happen: Instead of starting a second cluster I would expect the provider to be able to recognize that the desired cluster already exists and just do nothing if it exists.

Anything else you would like to add: I can understand where this issue comes from, I only the define the MachineDeployment but the hetzner-controller then provisions new HCloudMachines that are not stored in git, therefore when I completely wipe the management cluster that reference is lost and hetzner-controller can't know that my desired cluster is still there.

However I still think that this usecase is essential, because after all ClusterAPI was designed to allow declaretive cluster management.

A solution could be to use the labels set on the servers to check which nodes belong to the desired cluster and with that information restore the HCloudMachines that were wiped.

If needed I can provide a demo repository to recreate the problem.

Environment:

batistein commented 5 months ago

Hi @danielr1996, a demo repository would be great! In our Kubernetes offering, we heavily use a GitOps flow (ArgoCD) - as this uses CAPH under the hood, this should work for you as well. So it would be helpful to be able to reproduce it.

danielr1996 commented 5 months ago

Thanks for the fast reply and great to hear that this should theoretically work, I'll prepare a demo repo

danielr1996 commented 5 months ago

I've created a demo repo: https://github.com/kubecraft-k8s/cluster-api-provider-hetzner-1136 just fill credentials and run the script, you will see that after a short time there are two clusters instead of one

batistein commented 5 months ago

@danielr1996, I just had a look at the script to understand its functionality. I haven't executed it yet. From what I can tell, you perform the following steps:

  1. Use a kind cluster as a bootstrap.
  2. Install all operators.
  3. Install the secret containing the credentials.
  4. Apply the cluster-related manifests.
  5. Wait for the cluster to become operational.
  6. Retrieve the kubeconfig to access the workload cluster.
  7. Delete the bootstrap cluster.
  8. Create a new empty bootstrap cluster.
  9. Repeat from step 2.

However, I'm uncertain about the expected outcome due to the following reasons:

  1. You don't install any Cloud Controller Manager (CCM) or Container Network Interface (CNI) - simply checking the MachineDeployment does not guarantee that the cluster is ready and functional. Based on this script, it appears the cluster will never be fully initialized.
  2. Deleting the bootstrap cluster, which serves as the management cluster, leads to the deletion of all custom resources representing the workload cluster. Consequently, the workload cluster loses its "control plane." For a workload cluster to be self-sustaining, you need to deploy the operators on the workload cluster and then transfer the custom resources from the management/bootstrap cluster to the workload cluster. This process is not evident in the script.
  3. If you then create a new kind cluster as a bootstrap/management cluster, which in turn starts a workload cluster with the same name, it merely results in a new cluster. This is akin to creating an empty database, populating it with some demo data, deleting the database, and then creating a new one with the same demo data. Although the content remains the same, these are two independent and isolated operations.

If there's something I've misunderstood or if you have any questions, please feel free to ask or correct me.

danielr1996 commented 4 months ago

Hi @batistein ,

I didn't install the ccm and cni because it's doesn't really matter for the example.

Point 2/3 is exactly my point, I have two use cases where I would need that:

1) to restore a permant management cluster from a backup 2) spin up an ephemeral management cluster on kind to update the workload cluster

I got the backup part working with velero but it feels a bit akward because to update the workload cluster I need the following steps

0) update the yaml for the workload cluster 1) spin up the kind cluster 2) restore from backup 3) apply the changes to the workload cluster 4) backup the changes 5) commit the changes 6) apply manifests for flux to sync a git repo

While it could theoretically be

0) update the yaml for the workload cluster 5) commit the changes

1) spin up the kind cluster

6) apply manifests for flux to sync a git repo

It may not be a bug because it works as intended, but I still think it would be very useful to have this feature because it heavily simplifies the operation of a cluster and eliminates the use of a seperate backup solution.

You said that you also use a GitOps workflow, so do you really have the git repo as the single source of truth, or do you also have a separate backup solution like velero?

janiskemper commented 6 days ago

@danielr1996 sorry for not responding here! Just saw this old issue. I didn't 100% understand your use case. I believe that you want to backup and restore a management cluster, and somehow use flux for this.

In the CAPI community (e.g. on Slack), you'll find quite some people that use velero for this. Both backup and restore. Can you just confirm that this is a use case you have, and if so, why you don't want to use velero for both operations?

Considering GitOps: you mainly need to watch out that CAPI creates a lot of resources on its own, which will not be part of your manifests in the Git repo. If you think about that, then GitOps is no problem at all in general. I just don't fully understand how you want to combine GitOps and backups