Flesh out the input data model and patterns

bgrant0607 commented 2 years ago

Topic that needs more work.

We've figured out some aspects and requirements of package / function inputs:

Packages are not encapsulated and don't have monolithic package-specific interfaces: https://kpt.dev/guides/rationale
Setters are another form of manually specified parameters, and are not recommended: #3131. We also ran into problems with composition of multiple packages when using setters.
Individual attributes should be able to just be edited in place, since we store the rendered output and update it in place, and functions should "patch" resources in general, rather than blow them away #2528.
KRM functions can take their inputs from a specified "function config", which is either a ConfigMap or a client-side KRM type, such as ApplyReplacements, which we should be able to automatically map to functions #3339.
We've noticed with value transformers #3155, such as set-namespace and set-labels, that input values often need to be copied from their sources to the input structures expected by the functions.
The variant-constructor pattern can use the package name to provide distinct identities for variants.
We've identified the need for a deployment target "context", such as cluster targets #3387, along the lines of kubeconfig, gcloud config, terraform provider config, etc.
Standardization of input types, APIs, across packages increase the opportunity for plug-and-play-style automation
There may be other properties associated with the variant "context" that we need to discover or users need to provide. If the latter, we may need to be able to identify those attributes automatically, so that we can prompt the user. Multiple sources of context are common, such as environment and application.
We need to be able to find sets of input contexts in order to automatically generate corresponding sets of deployment packages. #3347
We need to be able to identify downstream inputs in order to implement a "replay" approach to package upgrades. #3329
We want to support loosely coupled external dependencies, such as an application package like ghost requiring a namespace or a SQL database
Fully dynamic data, like autoscaled replica counts, may not belong in config storage. Another common example is allocated IP addresses, for which service discovery systems and DNS are common. But there may be some that we want to "snapshot" and write to storage. GitOps image updaters are an example.
Some inputs may be reasonably self-contained, such as application config for ConfigMap generation #3119.

But we don't have a fully fleshed out model or recommended patterns yet.

kpt isn't the first config tool to encounter these issues. We should look at data-oriented, non-package-parameter-based models for inspiration.

Some examples:

Puppet's facter and hiera. Core facts are kind of like standardized context.
Ansible inventory. Ansible was described as infrastructure as data. Example of separating out input data.
Terraform data sources. Example
Kapitan inventory. Video. Generator reference.
Various runtime parameter stores: Consul, Pulumi config, etc.

Additional thoughts or findings should be posted back here.

cc @justinsb @johnbelamaric @droot @yuwenma

bgrant0607 commented 2 years ago

Related: When gathering inputs, we may need to allow network access: #2450. And probably a way to provide credentials.

bgrant0607 commented 2 years ago

It's also worth mentioning kustomize components: https://github.com/kubernetes-sigs/kustomize/blob/master/examples/components.md https://github.com/kubernetes/enhancements/blob/master/keps/sig-cli/1802-kustomize-components/README.md

johnbelamaric commented 2 years ago

Related: When gathering inputs, we may need to allow network access: #2450. And probably a way to provide credentials.

Do we need to solve this in the CLI case / with kpt functions? That is, could more complex cases like this be handled instead only in the Porch incarnation of CaD, where we can build controllers that interact with other systems in any way we want? If an interactive CLI based session requires network reach out, then it can more easily fail, for example. Also, there are interactions we will never be able to handle that way - for example, imagine that getting an input requires filing a ticket, which a human then responds to. In the controller case, we can handle this sort of arbitrary-time-delay without any trouble. But it won't work at all in the interactive kpt fn render case.

bgrant0607 commented 2 years ago

@johnbelamaric I don't expect inputs to be generated during the kpt fn render pipeline, in general. It may consume the inputs. Input generation / gathering likely needs to be decoupled. Interactive forms or prompts is one such example.

Your ticket example is a good one, thanks. If you think of others, post them here.

johnbelamaric commented 2 years ago

A few quick thoughts, all slight variations on "fetch from external system":

Read from a CMDB or other external database
Allocate from IPAM or other external system
Read from another cluster (e.g., get the LB IP of an LB-backed K8s Service that has no DNS entry)
Read from the config of a package on which this depends

johnbelamaric commented 2 years ago

Read from the cloud provider API. For example, if an app is dependent on a DB application, that may be another subpackage, or it may be a cloud provider DB instance (which maybe is provisioned by a separate package, or maybe not).

Not all of these are necessarily only "function inputs". They could simply be ways of setting field values. For the example in the IPAM case, I can imagine a couple different approaches (this applies to others too, probably).

The resources that have an IP address field of course just accept an IP value; they do not have a concept of sourcing that value from anywhere. But we could use a placeholder value and a marker comment. The marker comment could indicate the inputs to the IPAM system. A controller (running in the Porch cluster) could see an unresolved placeholder (or the marker comment could indicate this, to avoid a conflict with that sentinel value), and use the data from the comment (which would be arbitrary from the package point of view, for example: "region", "cluster-name", "package-name") to call out to the IPAM, and get back an allocated IP. This would have to be an idempotent operation.
Another approach would be to use an intermediate resource, which could effectively define an API. So, you have some CR that represents an IPAM request. The controller (or arguably a function) processes that request and stores back the allocated value in a status field. This can then be referenced by the function input by whatever mechanism we come up with for field references. Or, if we support references in field values, it could be placed directly in there.

Reading that over, the second approach is probably more maintainable and flexible.

bgrant0607 commented 2 years ago

CMDB is an example use case for dynamic inventory in ansible, such as via inventory plugins and inventory scripts.

In addition to querying inputs dynamically, adapting input data locations / schemas to expected function input locations / schemas (or, in the case of IaC, to parameters of off-the-shelf packages) appears to be one of the other core / common issues.

bgrant0607 commented 2 years ago

Example from slack: https://kubernetes.slack.com/archives/C0155NSPJSZ/p1658760504705309

How to provide information to packages automatically.

bgrant0607 commented 2 years ago

The idea of "decorations" was discussed in the app config issue: https://github.com/GoogleContainerTools/kpt/issues/3351#issuecomment-1190399974 https://github.com/GoogleContainerTools/kpt/issues/3351#issuecomment-1192052502

kubectl expose and autoscale are examples of this.

Resource creation might be imperative, but this does raise the issue of using information from resources themselves as function inputs.

In the ghost package, we're experimenting with that approach as a way to propagate the host name: https://github.com/GoogleContainerTools/kpt/pull/3403/files

We could also use the approach to read resource requests and set application resource-dependent settings accordingly: https://github.com/GoogleContainerTools/kpt/issues/3210#issuecomment-1194882090

In order to be understandable there probably needs to be an intuitive source of truth. A potential advantage of the approach is that the source of truth could be well known, as opposed to an input to an arbitrary function. However, if multiple locations disagreed and the source of truth were ambiguous, then the user would need to be asked to resolve the inconsistency, as when providing multiple values in an undiscriminated union.

This approach could have implications for update strategies.

yuwenma commented 2 years ago

A Tekton example from Slack: https://github.com/marniks7/chaos-catalog

Slack message: https://kubernetes.slack.com/archives/C0155NSPJSZ/p1661457969525029?thread_ts=1661311193.053569&cid=C0155NSPJSZ More example for non-KRM file: https://github.com/GoogleContainerTools/kpt/issues/2350#issuecomment-1228000792

bgrant0607 commented 6 months ago

Example from another domain: https://support.microsoft.com/en-us/office/use-mail-merge-for-bulk-email-letters-labels-and-envelopes-f488ed5b-b849-4c11-9cff-932c49474705

kptdev / kpt

Flesh out the input data model and patterns #3396