Package Conditions: determining when a package is ready for deployment

Describe your problem

One issue that has come up for both the human and machine (#3452) consumption workflow is how to know when a package is ready for deployment. That is, knowing when all the required and/or expected customizations are complete; knowing when the package has moved from an abstract to a depoloyable package. On the human consumption side, this relates to providing a guided experience in the GUI or CLI (@droot did some experimentation with a CLI workflow). On the machine side, we want to automatically move a package through the life cycle. One of the goals is to enable humans and machines to do iterative, collaborative editing of the package configuration. If multiple actors are operating on the config, how do we know when we can propose the package for approval into the deployment repo? How do we know when a package is ready for the next step in the life cycle (#3422)?

Proposal

We face a similar issue in standard K8s controllers, and the solution has been Conditions. For example, we control whether a Pod is added to an EndpointSlice based upon its Ready condition. Can we allow packages to publish conditions in a similar way? They could be surfaced via the kpt CLI and via the PackageRevision CR in Porch. This can provide a "punch list" for what needs to be addressed in order for a package to be proposed for deployment.

Packages would need to be able to define and publish their own Conditions. For example, if we need to allocate a value from an external source, we can use a Condition to represent if that has been done or not. It could be done manually or by a machine; either way should automatically flip the condition to "True". We then need some conventions around determining "readiness" for different parts of the life cycle; how do we aggregate a set of potentially package-specific Conditions into the one Condition that indicates readiness for the next phase? One solution would be a convention for a condition for each step in the life cycle (with flexibility to alter that life cycle), along with something like readiness gates.

Using this for rollouts

In #3348 there is a short discussion about using pinned syncs to control rollout. IIUC, that means controlling rollout at the ConfigSync step in the life cycle. This is too late in my opinion; at least it is when we have a 1:1 mapping between deployment repo and cluster. It means I can no longer just inspect the contents of the deployment repository to know what's being delivered in my cluster, but instead I have to look at the various syncs installed in the api server. Those may map to different tags and even different repos all over the place. This degrades the simplicity of the model.

Another approach would be to use status conditions on PackageRevisions. In this model, we can still have a simple 1:1 mapping between a deployment repository and a given cluster's API server. In the case of Porch, an additional condition controlled by external data provides a simple way to do rollouts that is decoupled from the config editing process. To do this, a controller can look at all "proposed" packages to a deployment repo matching some particular label selector, and can twiddle a status Condition. A different controller can then approve packages with the appropriate conditions, and that package merges. This decouples:

The determination of "package ready for deployment" from the config point of view (this maps to Proposal)
The decision of whether to deploy the package (this maps to twiddling the ReadyForApproval Condition)
The actual approval/deployment of the package (maps to Approval and merge to main)

The pinned syncs may be the right approach if the cardinality of repo:cluster is 1:N rather than 1:1, but I think a 1:1 cardinality is simpler to understand and lends itself to this approach instead.

cc @droot @yuwenma @bgrant0607 @justinsb @mortent

Here is a more drawn out description of how Conditions enable collaborative editing of the config, as well as decouple it from roll out:

We have the concept of Proposed and Published packages against deployment repositories; this is the distinction between "ready for rollout" and "the intended state". With the concept of "Conditions" and "Condition Gates" for PackageRevisions, we can have multiple actors working on a package to converge on a "ready for rollout" state. The roll out controller ignores all packages prior to that.

For example:

A fan-out CR identifies "ensure this package is running on these clusters" (or "these packages", potentially). The fan-out controller uses that, and combines the data with cluster-specific context (and potentially performs other mutations via functions), and pushes the package as a Draft, setting a Condition saying that this first piece is done.
Some of the Draft packages have unresolved dependencies, echo of which trigger a particular Condition to be False. Other controllers are running, and they each understand how to resolve certain Conditions. Once a controller sees the first Condition is True, but that there is a Condition it knows how to resolve that is False, it engages with the package to resolve the dependency (e.g., allocates from IPAM), and updates the Draft. The Condition resolves.
Some packages have another Condition that can only be resolved by a human. A controller sees this and raises a ticket. A human responds the ticket, either directly by editing the config in the GUI, or indirectly by adding something to the ticket that the controller can inject into the config. In either case, the Condition resolves.
Each package can define a "ReadyToPropose" gate, which lists all conditions that must be true to be able to propose the package. Once those are satisfied, the ProposalController proposes those packages.
Finally we are to roll out. We have a Approval controller that looks at all Proposed packages, and combines that with roll out policy and metrics (e.g., roll out the first 1%, watch for errors for X time, then roll out the next 5%, etc.). It can then Approve a subset of the Proposed to follow that policy.

Alternatively to the "Proposed" state, we can use another Condition ReadyToApprove. But the Proposed state plays nicely with the human based workflow.

kpt API

We want to expose the criteria for when a package is ready to be used as part of the package. This is useful for packages in general, so we expose this through the Kptfile rather than implementing this on the porch level.

The Kptfile already contains an Info object with information about the package, so this seems like a logical place to add the new field. We introduce a new field .Info.ReadinessGates that allows users to specify the Conditions that must have the value True for the package to be considered ready:

type PackageInfo struct {
    …
    ReadinessGates []ReadinessGate `yaml:”readinessGates,omitempty” json:”readinessGates,omitempty”`
    …
}

type ReadinessGate struct {
    ConditionType string `yaml:”conditionType,omitempty” json:”conditionType,omitempty”’
}

The structure is chosen to align with the PodReadinessGate type already used in Kubernetes.

Conditions for packages are handled just like regular conditions in Kubernetes, although with its inclusion in the Kptfile it is possible that it will be set by users manually through an editor rather than by controllers. But the conditions being set by controllers through the porch API is probably the most likely use-case here too.

Both the ReadinessGates and the Conditions can be updated by actors (human or controller) working to make the package ready. For example:

As part of trying to fulfill a requirement expressed by an entry in the .Info.ReadinessGates, a controller decides an additional dependency needs to be fulfilled before the package can be considered ready. So it adds an additional ConditionType to .Info.ReadinessGates which will then be noticed by another controller which will satisfy the dependency and then set the corresponding condition to True.

Porch API

To make it easy for controllers to interact with packages for automating updates to packages, we will expose the readiness gate through the porch packagerevision API. The readiness gate will be exposed as a new field in the Spec of the PackageRevision resource:

type PackageRevisionSpec struct {
    …
    ReadinessGates []ReadinessGate `yaml:”readinessGates,omitempty” json:”readinessGates,omitempty”`
    …
}

type ReadinessGate struct {
    ConditionType string `yaml:”conditionType,omitempty” json:”conditionType,omitempty”’
}

This mirrors the field in the Kptfile.

Conditions will be added to the status object of the PackageRevisions resource using the standard Condition type from Kubernetes. The conditions will be stored in the status object of the Kptfile. It feels a bit unusual to have Conditions on a client-side CR like the Kptfile, so I would like to hear what others think about it.

Storing the requirement gate and the conditions in the Kptfile also has the benefit that the git history will leave an audit log.

Controller

We will provide an operator as part of porch that will watch packagerevisions and automatically use the porch API to propose them when the conditions listed in the readiness gates become True.

The absence of any conditions listed in the readiness gates means the package is ready to be published, so the controller will immediately propose the revision. This is not desirable, at least initially while we figure out the best workflow. Therefore, at least in the initial iteration of the package, users will need to use the config.porch.kpt.dev/auto-lifecycle-mgmt: “true” annotation to enable automatic propose for packagerevisions.

Usage

The idea is that the readiness gates defines the conditions that must be met for a package to be deployed. If we consider that a package might be going through multiple levels of cloning and customization before being deployed, it will be common for abstract packages to be published with all or a subset of the conditions in the readiness gate still not fixed. It is also useful for cloned packages to add additional conditions as a result of either special requirements or customizations that leads to additional (or fewer) required conditions.

I like the concepts here. I don't think we have certainty on all of this yet. We can take some first steps though: We have a well-established notion of validation functions; let's surface their results on the PackageRevision status. I like the idea of using Conditions, and if we can duck-type to metav1.Condition then this should feel very natural to users.

We'll likely also want to surface more information - e.g. validation functions can surface structured error/info results, hence duck-typing rather than reuse of metav1.Conditions.

I think from there we can add in human-validations and address the broader use-cases in this issue. I don't yet know how we should store those, I think it might depend on whether we end up moving PackageRevision to CRDs.

I think from there we can add in human-validations and address the broader use-cases in this issue. I don't yet know how we should store those, I think it might depend on whether we end up moving PackageRevision to CRDs.

We do need something beyond validation function status being surfaced sooner rather than later. I would like to be able to demonstrate in the Nephio workflow the use of package conditions to gate a package's publishing, where that condition is resolved by a controller. In particular, I am thinking an IPAM controller which looks at the type of workload, the region, and they type of interface and uses that to allocate an IP and insert it in the configuration. This is an important integration point into existing systems. This requires some way for controllers to set the status (there are ways we could use a function to manage the condition true/false status based on resource content in the package, but I am not sure that will always be possible).

Another possible use is coordinating cross-cluster dependencies, though am not sure this is the right mechanism for that. An example is like the one @sandeepAarna did in the 9/21 Nephio SIG Automation call: one package provisions a cluster, another package provisions a workload on that cluster. The second package needs to wait for the cluster to achieve Ready state. Actually, as I write this down I do NOT think package conditions are the right place for that particular coordination. I'll have to think about how we can do that (see also #3448).

Anyway, I think storing them in the status of a CR in the cluster works for porch, but storing them in a CR stored in the package works for both the CLI-only workflow and the Porch (UI or CLI) workflow. Do we think we need to support the CLI-only workflow? It does cause additional complexity.

Context

In an in-person discussion around package conditions, the question came up about how package conditions should be handled when a new package revision is created from a previous one. Should all/any of the package conditions be removed, or set to Unknown/False?

There are two properties involved here, the .spec.readinessGates and the .status.conditions.

Since the new packagerevision starts out as identical to the existing one (The revision is different of course, but I don't think that should matter here), it seems like we should keep both the readinessGates and the conditions intact for the next revision (if all those conditions were true for the previous revision, they should also be true for this one since it is identical). It feels like the problem is we need some way to give controllers and users time to make changes to a newly created packagerevision (regardless of whether it is created through init, clone or copy), and avoid them being automatically proposed as soon as they are created. Presumably a new packagerevision is created because it needs to be different than the original one. The proposal introduced the config.porch.kpt.dev/auto-lifecycle-mgmt annotation to temporarily manage this, but I think we need a better solution.

Alternatives

We could introduce some kind of delay between a packagerevision is created before it is proposed by the automation, but it is unclear what a reasonable delay should be. Overall, this approach seems a bit fragile.

We could also automatically add an additional readinessGate for new packagerevisions (it would then be cleared on clone/copy), which would prevent a newly created packagerevision from being proposed until other actors have had a chance to update the packagerevision (for example by adding additional readinessGates). But this feels like we have only moved the problem, as we then have to decide who/what is responsible for clearing that readinessGate, and how would that controller/user know when it should happen?

Next steps

I think we need to dig a bit deeper into the Nephio use-case here to see how this should work.

kptdev / kpt