nephio-project / nephio

Nephio is a Kubernetes-based automation platform for deploying and managing highly distributed, interconnected workloads such as 5G Network Functions, and the underlying infrastructure on which those workloads depend.
Apache License 2.0
104 stars 53 forks source link

Orchestrate rollout of bulk changes #717

Open liamfallon opened 5 months ago

liamfallon commented 5 months ago

Original issue URL: https://github.com/kptdev/kpt/issues/3348 Original issue user: https://github.com/bgrant0607 Original issue created at: 2022-07-07T15:22:52Z Original issue last updated at: 2022-11-16T03:10:19Z Original issue body: There are a number of types of bulk operations we want to be able to make:

and so on.

The existing edit, propose, approve workflow is sometimes appropriate and desirable, but gets tedious at scale. We'll want to automatically vet and bulk review (diff with previous), preview (diff with live / dry run), and approve changes, then orchestrate gradual rollout (or rollback) of the changes.

Generally I recommend pinning syncs to specific revisions / commits / digests, much as with container images -- pulling from head is tantamount to pulling from :latest. That has a number of benefits, such as providing an API-level signal that an update has been pushed, but in this case enables decoupling the authoring time, with a human in the loop, from the deployment time. We should be able to orchestrate updates of pinned revisions even for RootSync and RepoSync objects in storage, such as by following annotations back to the sources of truth.

This is where ProdSpec and Annealing fit in: https://www.usenix.org/publications/loginonline/prodspec-and-annealing-intent-based-actuation-google-production

In Kubernetes, KRM solves the NxM problem. For instance, Config Connector, Crossplane, and other Operators orchestrate arbitrary external resources using custom resources (~Assets).

A ProdSpec Partition is kind of like a package or repo in kpt/porch. An Incarnation is approximately a revision. Our sources of data are (currently) git and OCI.

Instead of an out-of-place generation pipeline, with kpt and porch, we're generating eagerly and in place (though package generation blurs this line a bit). This reduces coupling of automation, such as security remediation, and enables interactive workflows, particularly GUIs.

Another model to consider in addition to rollout is coordination, similar to a semaphore or pod disruption budget: https://github.com/kinvolk/nebraska-update-agent

As with any rollout mechanism, figuring out whether something went wrong will be a key part as well.

cc @mortent @justinsb @barney-s

Original issue comments: Comment user: https://github.com/johnbelamaric Comment created at: 2022-08-11T20:47:51Z Comment last updated at: 2022-08-11T20:47:51Z Comment body: Is this intended to be bulk changes across packages within a cluster, or bulk changes of a package across clusters, or both? It may be helpful to separate those user journeys.

Comment user: https://github.com/johnbelamaric Comment created at: 2022-09-02T23:40:57Z Comment last updated at: 2022-09-02T23:40:57Z Comment body: Another approach, which at least in this conception does not use pinned commits but instead uses the package lifecycle states to control rollout: https://github.com/GoogleContainerTools/kpt/issues/3455#issuecomment-1212485912

I believe it is possible to decouple the actuation of the rollout (Publish vs pinned commit) from the rollout policy, selection, and evaluation mechanisms, though. That would enable either lifecycle state or commit pinning to be used.