nephio-project / nephio

Nephio is a Kubernetes-based automation platform for deploying and managing highly distributed, interconnected workloads such as 5G Network Functions, and the underlying infrastructure on which those workloads depend.
Apache License 2.0
106 stars 53 forks source link

Changing how we use ConfigSync #210

Open johnbelamaric opened 1 year ago

johnbelamaric commented 1 year ago

Currently, we setup ConfigSync with a single "RootSync" that slurps in the entire repository on the main branch into the cluster. This is not how it is typically used. It also leads to a problem due to how Porch manages package revisions, because when you "delete" a package revision in Porch, it deletes the tag but leaves the underlying directory in the main branch. So, the workload is not removed by ConfigSync.

To fix this, I propose we change how use ConfigSync. This change will actually set up us up for a couple of things better in the future:

  1. It separates "Publish" from "Deploy", and enables a single workload cluster to pull from multiple repositories. This means we can consume packages that do not vary across clusters from a single repo (not in R1, though). This also can enable progressive rollout (with some more work).
  2. It provides a way to integrate with ArgoCD as the GitSyncer (and maybe flux, I am not sure exactly yet - though the flux team already has that working anyway), with minimal changes.

So, what's the change? It's as follows:

  1. Rather than slurping the whole repository, we configure the workload cluster to slurp a single directory from the main branch, say "/deployed-packages".
  2. We will deploy a package to the repository that contains RootSync objects, that refer to the specific package revisions we want deployed on the cluster.

Our per-workload cluster RootSync installed by the bootstrap controller would thus look like this (this is identical to what we used in the workshop, except the addition of directory):

apiVersion: configsync.gke.io/v1beta1
kind: RootSync
metadata:
  name: nephio-workload-cluster-sync
  namespace: config-management-system
spec:
  sourceFormat: unstructured
  git:
    repo: https://github.com/nephio-test/test-edge-01
    branch: main
    dir: /deployed-packages
    auth: ...some auth stuff not shown here...

So, how do we manage the contents of /deployed-packages? We can populate it by deploying a package of that name to that repository. This can be automated, not something the user has to do. That package would contain additional RootSync (and maybe at some point RepoSync which are scoped to managing resources in a namespace) resources. Those would point back to the same repo, but specify a particular package, like:

apiVersion: configsync.gke.io/v1beta1
kind: RootSync
metadata:
  name: nephio-workload-cluster-sync
  namespace: config-management-system
spec:
  sourceFormat: unstructured
  git:
    repo: https://github.com/nephio-test/test-edge-01
    branch: main
    dir: /free5gc-upf
    revision: free5gc-upf/v6
    auth: ...some auth stuff not shown here...

Notice the revision field. This tells ConfigSync to pull from that tag and apply the resources in that directory - in other words, to apply that specific package revision. Thus, as we publish new revisions of the package, they do not roll out automatically. We must ALSO update this RootSync. To remove the package from the workload cluster, we delete the RootSync from the deployed-packages package, ConfigSync then removes the RootSync from the workload cluster, and removes the associated resources as well. The actual package still remains in the workload cluster repository; it's just not actually deployed.

Use of a R*Sync per workload is a more standard use of ConfigSync than what we are doing today.

The next question is how we can automate the creation and management of the per-cluster deployed-packages package. For that:

  1. Define a CRD to represent a Cluster for workloads. This would be similar to the Cluster resource created in the workshop; it would represent the consumption (by Nephio) aspect of the Cluster, not the provisioning aspect (as needed by CAPI or other cluster provisioners). So, it really comes down to: 1) a K8s endpoint; 2) a credential/secret. Wim is already doing something like this for the bootstrap controller; we will need to tweak that just a little, I think (actually, we don't need anything for this except to know the cluster exists and what repo it is associated with; we could get away with using the workload cluster Repository for R1. We would want the cluster record for ArgoCD integration, not for ConfigSync integration).
  2. Define a CRD to represent the deployment of a package in a cluster: a pair of Cluster and Upstream (repo/package/revision). We may need some association of Cluster and Repo too, with the name of the secret that's installed on the workload cluster or something. Details TBD.
  3. Write a controller that consumes these resources and produces a package based on them. This is the deployed-packages package for that cluster.

To simulate "Publish == Deploy", we could, the controller could auto-propose and auto-approve the deployed-packages package, if we want.

If you want to implement Argo instead of Config Sync, you can have the controller emit Argo Application resources instead of a "package of RootSyncs".

So in the end, to implement this we need:

  1. A change to the RootSync we are already setting up in the workload clusters to include the dir.
  2. A couple new CRDs and an associated controller to reconcile and auto-approve the deployed-packages package for each cluster.
johnbelamaric commented 1 year ago

/cc @henderiw @tliron @s3wong

johnbelamaric commented 1 year ago

/sig automation /area workload-cluster /area package-management

tliron commented 1 year ago

I think you're exactly right, but there's a lot of detail to go over. I suggest a presentation during a meeting so that the community is on the same page. I would also suggest that it's a bit late in the release cycle, so that maybe that can wait for soon after R1. But still, strong yes, we are not using Config Sync "correctly".

iamvikaskumar commented 6 months ago

Hi @johnbelamaric, what I have observed is that ConfigSync removes the RootSync from the workload cluster after we remove RootSync from the deployed-packages but it does not removes the associated resources from the cluster.

johnbelamaric commented 6 months ago

Hi @johnbelamaric, what I have observed is that ConfigSync removes the RootSync from the workload cluster after we remove RootSync from the deployed-packages but it does not removes the associated resources from the cluster.

Ooh. Interesting. That could be an issue. I thought it tagged everything to say it was managed by that root sync and would delete it.

iamvikaskumar commented 6 months ago

Hi @johnbelamaric, what I have observed is that ConfigSync removes the RootSync from the workload cluster after we remove RootSync from the deployed-packages but it does not removes the associated resources from the cluster.

Ooh. Interesting. That could be an issue. I thought it tagged everything to say it was managed by that root sync and would delete it.

pointing the rootsync to an empty repo deletes the associated resources.

johnbelamaric commented 6 months ago

pointing the rootsync to an empty repo deletes the associated resources.

Ok, great to know. So, we may need to do that first somehow. Maybe have a repo (or branch?) we can point at that represents a deletion, and then finally remove it once the resources are all cleaned up.

iamvikaskumar commented 6 months ago

pointing the rootsync to an empty repo deletes the associated resources.

Ok, great to know. So, we may need to do that first somehow. Maybe have a repo (or branch?) we can point at that represents a deletion, and then finally remove it once the resources are all cleaned up.

The challenge here is to monitor the resources created by the rootsync (for cleanup).

iamvikaskumar commented 3 months ago

pointing the rootsync to an empty repo deletes the associated resources.

Ok, great to know. So, we may need to do that first somehow. Maybe have a repo (or branch?) we can point at that represents a deletion, and then finally remove it once the resources are all cleaned up.

Config Sync version 1.16.0 and later supports delete propagation to all the previously-applied objects. We need to add the below annotation : configsync.gke.io/deletion-propagation-policy: Foreground