bgrant0607 commented 2 years ago

This is related to #3119, but deserves its own issue.

In application-related resources, application configuration often constitutes a large proportion of the overall configuration size.

Application configuration is special in multiple ways:

Attributes can't be derived from well known KRM resource types
Many different formats, which are not KRM and sadly not as standardized as they could be
No explicit schema that kpt has access to

Command-line flags are evil, so I'll punt on them for now, other than using env var substitution to define their values.

Env vars are about the best case. Kustomize has support for generating ConfigMaps from env files, and Kubernetes can inject them as envvars. And, if represented natively in a ConfigMap or in a pod template, then they are KRM and could be edited as such. There's still no native schema though (https://github.com/kubernetes/kubernetes/issues/4210). A command for editing env vars would also be nice.

I haven't looked for any kind of data, but presumably there are some relatively common file formats, such as INI, TOML, Spring Boot properties, etc.

A common, rational instinct is to normalize such formats into a universal, simpler structured form, generally a simple map or nested map. The most common approach is templating and template parameters, with all the consequences that implies. It's less terrible than other uses of templating if one views config files of unknown formats as just unstructured text, but does feel suboptimal. For instance, anyone familiar with how an application is configured would then need to learn the new representation and how it maps to the application-native one, since often syntax, capitalization, etc. are different. It also frequently requires insertion of conditional logic to handle present / not present of the properties. Some formats, such as JSON, are particularly challenging to ensure the output is valid.

For a variety of reasons, we rejected several proposals to support templating in Kubernetes itself (e.g., https://github.com/kubernetes/kubernetes/issues/30716, https://github.com/kubernetes/kubernetes/issues/89738, https://github.com/kubernetes/kubernetes/issues/96346).

We investigated this issue some when we were designing ConfigMap (https://github.com/kubernetes/kubernetes/issues/1553, https://github.com/kubernetes/kubernetes/issues/2068).

I wonder if we could do something with http://augeas.net/index.html "Augeas is a configuration editing tool. It parses configuration files in their native formats and transforms them into a tree. Configuration changes are made by manipulating this tree and saving it back into native config files."

We would like to provide a similar WYSIWYG transformation and editing experience for application configuration as for KRM resources, at least for a subset of common formats. We could even recommend an automation-friendly format for people writing their own applications.

This affects ~all the functionality of kpt: update merging, diffs, source and sink, function SDKs, the UI.

For example, we also need to be able to do granular merging during updates, in the original non-KRM config file, and the ensure any ConfigMaps they are embedded into are updated (#3119).

justinsb commented 2 years ago

One thing that Craig Box got me noodling about ... does yaml matter to kpt / to kubernetes? It clearly doesn't really matter; it's just a representation that we've decided upon.

Craig (jokingly?) suggested INI files as an alternative to yaml, and perhaps that is the path here. When we write configuration in INI or toml, we are actually setting values in a configuration object. That configuration object doesn't allow all keys, and has various restrictions on the values of those keys. In other words, even though we're writing in a different "expression" language, we could imagine writing an OpenAPI spec to describe the schema of the configuration.

This suggests we could think about writing a set of transformation functions from instances of CRDs to the various common configuration file formats. By doing so, we bring legacy configuration into the better-structured world of kubernetes and KRM.

We could do so either as a client-side object or as a true CRD with an operator.

This doesn't obviously solve #3119, so I'd imagine we would start client-side.

bgrant0607 commented 2 years ago

An example of an application with lots of configuration is kafka: https://github.com/bitnami/charts/blob/master/bitnami/kafka/values.yaml#L93 https://github.com/mesosphere/dcos-kafka-service/blob/master/frameworks/kafka/universe/config.json

Similar to the overall approach to WYSIWYG configuration, I wouldn't want to abstract the application configuration. For instance, as a user or developer I'd expect it to match what I saw in the code or development environment or documentation: https://kafka.apache.org/documentation/#configuration

So, yes, some apps would express configuration in INI or TOML.

This is where something like Augeas is interesting. "Augeas is a configuration editing tool. It parses configuration files in their native formats and transforms them into a tree." Looking at http://augeas.net/docs/augeas.pdf, the idea sounds very close to what we would want. Like a pluggable source/sink for specific non-KRM file types.

bgrant0607 commented 2 years ago

With #3118, we wouldn't technically need a custom source/sink. We'd still need custom parsing, marshaling, and visualization, though.

bgrant0607 commented 2 years ago

As a concrete example that would address a segment of applications, we looked at Spring Boot config (application.properties) in the early days of the kpt project, but it looks like the demo video recordings don't exist any more. This post discusses it: https://www.springboottutorial.com/spring-boot-application-configuration

bgrant0607 commented 2 years ago

One specific category of application configuration is resource-dependent configuration: VM heap size, thread pool sizes, simultaneous connections, cache sizes, etc. Network- and disk-intensive applications often have a number of these tunable settings.

A number of legacy applications and even language runtimes are not container-aware. As an example, before Java was container-aware, additional automation was necessary that is not in more recent versions of the JDK.

Ideally these settings would be derived from container resource limits, either at run time, such as using an init container, or an application-specific function, which would be lighter weight than either an Operator or admission controller.

cc @johnbelamaric

johnbelamaric commented 2 years ago

I like an app-specific function, especially if is written in something like Starlark that does not require building and maintaining and coordinating versioning for a separate container image. An init container or custom Go function would require that.

bgrant0607 commented 2 years ago

This video discusses an in-pod templating approach using init containers, which is a variation on the entrypoint.sh script approach: https://youtu.be/eJmNSYvelSw?t=1087

I'm liking the Augeas idea, though. If we could convert lots of config formats to a canonical form in kpt fn source, we could manipulate the canonical form and write it back using kpt fn sink.

bgrant0607 commented 2 years ago

https://osquery.io/ apparently integrates with Augeas. https://www.uptycs.com/blog/using-augeas-with-osquery-how-to-access-configuration-files-from-hundreds-of-applications

That's read-only, for queries.

Puppet integrates it also, for setting values: https://puppet.com/docs/puppet/5.5/resources_augeas.html

And there's Go integration: https://dev.to/raphink/configuration-surgery-with-go-structure-tags-12a4

bgrant0607 commented 2 years ago

More examples: https://ghost.org/docs/config/ https://dev.mysql.com/doc/refman/8.0/en/server-configuration-defaults.html https://www.postgresql.org/docs/current/config-setting.html#CONFIG-SETTING-CONFIGURATION-FILE https://www.rabbitmq.com/configure.html https://redis.io/docs/manual/config/ https://www.nginx.com/resources/wiki/start/topics/examples/full/ https://prometheus.io/docs/prometheus/latest/configuration/configuration/ https://etcd.io/docs/v3.4/op-guide/configuration/ https://www.vaultproject.io/docs/configuration https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html https://wpmudev.com/blog/wordpress-wp-config-file-guide/ (php may be too hard) https://www.drupal.org/docs/configuration-management/managing-your-sites-configuration

We can look through charts for more examples: https://github.com/bitnami/charts/tree/master/bitnami

selfmanagingresource commented 2 years ago

Some discusson in our Kpt office hours

bgrant0607 commented 2 years ago

List of formats it looks like we need support for:

INI: https://github.com/go-ini/ini
Properties: https://github.com/magiconair/properties
JSON: https://pkg.go.dev/encoding/json
YAML: https://github.com/go-yaml/yaml
XML: https://pkg.go.dev/encoding/xml
TOML: https://github.com/pelletier/go-toml
env: https://github.com/caarlos0/env (?)
Line-delimited text, as a fallback for non-data-format cases like php, sql, lisp, etc.

This is not a lot of formats. They all have Go implementations with permissive open-source licenses, though they may not preserve comments and whitespace.

I like what Augeas has done, but most of the 300 formats it supports are for system files, which we don't need, so it would probably be easiest for us to develop our own implementation and canonical representation. We would want the mechanism to be similarly pluggable.

We will want to be able to infer the format, such as from file extension and/or trying to parse the file, with a fallback for the user to be able to specify the format.

bgrant0607 commented 2 years ago

Because https://github.com/kubernetes/kubernetes/issues/831 was never done, the configuration needs to be in a ConfigMap in order for it to be injected into the application in a straightforward manner.

Options we discussed today for how to represent app config:

Only represented in the native format (INI, etc.) in the package. This would require apply-time conversion to a ConfigMap, which is kind of similar to what is sometimes done for Secrets. It would also require translation to/from a canonical internal representation for update, diff, KRM function format (kpt fn source and sink), KRM function SDKs, the UI, and anything else manipulating configuration (e.g., a command like jx gitops yset). This experience would be most similar to the current kustomize experience, for kustomize users that don't commit the kustomize build output.
Only represented in KRM, analogous to the internal format mentioned above, in the package. This requires a migration tool to convert from/to the native format and a way to translate it into the internal format for consumption by the application, such as apply-time translation and wrapping in a ConfigMap, or apply-time wrapping in a ConfigMap and runtime translation in an init container or controller.
Represented in both the native format and KRM in the package, with one of them as the source of truth and translation performed eagerly.

The advantage of the application's native format as the source of truth (option 1 or 3) is easier compatibility with the existing application ecosystem(s), without frequent format migrations: reference documentation, tutorials, samples, generators, editors, IDE plugins, Augeas plugins, etc. For instance, here's a mariadb config I could copy/paste: https://www.ibm.com/docs/en/ztpf/1.1.0.15?topic=collection-mariadb-configuration-file-example

I personally don't have a problem with option 3, but it would be useful to get feedback from actual users.

For all the options, our tooling would manipulate our canonical representation.

The problem of a lack of a schema exists for all the options. We'd design the schema to match our canonical format regardless of which option we picked.

Option 1 requires more conversions back and forth by kpt. Option 2 requires more conversions back and forth by the user. Option 3 is the simplest and most flexible, but possibly harder to understand.

bgrant0607 commented 2 years ago

An example of toml embedded in helm chart values: https://github.com/influxdata/helm-charts/blob/master/charts/telegraf/templates/configmap.yaml https://github.com/influxdata/telegraf/tree/master/plugins/ and one opinion on that experience: https://youtu.be/LBCmMTofNxw?t=1937

johnbelamaric commented 2 years ago

I suspect all three will be needed, but from a preferred order, I find Option 3 more aligned with the vision, for a couple reasons:

I think apply-time transformations should be avoided when possible, to ensure the integrity of the storage vs live state comparison.
Option 2 is painful, as evidenced by the opinion expressed above. But for very simple cases, it could be useful.

For Option 3, we can make it more palatable with a convention to identify the generated ConfigMaps. We have also discussed management of historical ConfigMaps so this fits in pretty well with that concept. For example, a particular annotation or even storing them in a special directory. A couple other considerations, that perhaps should be discussed on #3119 are: 1) how to combine multiple non-KRM files into a single ConfigMap; 2) how to name, annotate, label, etc. the ConfigMap. I am imagining a "stub" ConfigMap such that functions take in that CM, the raw file resource, and a key name.

bgrant0607 commented 2 years ago

I think what @yuwenma demoed was essentially option 2: represent the configuration in a canonical KRM format in the package. But instead of adapting the format in the apply step, used a ConfigMap with granular key-value pairs as the canonical format and added an init container to convert that to INI for the application.

johnbelamaric commented 2 years ago

Agreed. What I missed in your description of 3 above was that we would store in the canonical format - I was reading it as representing the native format and the generated ConfigMap(s), treating the canonical format as an intermediate in-memory representation. So we actually have three different formats: native, canonical, and generated ConfigMap. Which expands the options a bit, as to storing which subset of these three.

johnbelamaric commented 2 years ago

The other point we need to consider is the source of truth. Clearly the generated ConfigMaps are not it. So it leaves the native and canonical formats. If we store the canonical format, then we will have some confusion as to which is SoT.

Another way to think about SoT is to make it an opinionated pipeline of overrides. The native format - the one most easily edited by humans - is the input to the pipeline, which then may override values in that input to produce the final ConfigMap. This works pretty well for the simple case of an independent file and is straightforward: I edit the native file, but my fn render pipeline may tweak it further and rewrite the file. If we store the canonical format too, I think it muddies these waters.

This method doesn't preclude us being smart about the updates to the native files by internally parsing them to the canonical format, nor does it preclude us using that canonical format to present edits in the UI. Those updates and UI-based edits are subject to being overridden by the pipeline, of course.

It gets tricky when we have inputs that are interrelated between the config file and other resources, though. For example, if we change the port in the native file, does that propagate through the the Service port? Or vice-versa? While the "input with pipeline overrides" doesn't solve this problem, I think that's OK. This is actually the same problem we have for any other resources wrt SoT; the input just happens to be in a different format.

bgrant0607 commented 2 years ago

Ooh, I like the idea of storing generated objects in a subdirectory. That might be a useful pattern for generators more generally, especially in the case that post-generation edits aren't feasible: #2528.

Something I proposed in slack: Any applications that can specify config via environment variables should probably do so for now. The ConfigMap with granular key-value pairs could serve as the canonical format. Though it's not quite the native env file format that could be sourced by the shell (added to list above), it should be familiar to Kubernetes users.

bgrant0607 commented 2 years ago

Regarding 3 formats: fair point.

johnbelamaric commented 2 years ago

3422 is relevant to this discussion

bgrant0607 commented 2 years ago

This PR has an example possible canonical format using granular, flattened key-value pairs (similar to Augeas's internal format) in a ConfigMap: https://github.com/GoogleContainerTools/kpt-samples/pull/11/files

The python program that converts the corresponding env vars to the app's native INI format, which runs as an init container, is in that PR also. Presumably there's also a program that converts INI to the canonical format. Here there are only 2 formats because the canonical format is fed directly to the init container as opposed to generating a ConfigMap with an embedded INI file.

yuwenma commented 2 years ago

One big advantage for option 3: Once users accept the idea of using canonical format to represent their non KRM app config, they can build logic between the non KRM files and their k8s resources directly and this will give them more flexibility to mutate and validate the package as a whole.

For example, by writing a simple KRM validator function, the platform developer can guarantee the MariaDB port number in INI file is the same as the Ghost deployment database port number. Right now, the most feasible way to do this is to use multi-line setters (not sure if it still works or not), which is the opposite of what we want.

I really like the summary that "Option 1 requires more conversions back and forth by kpt. Option 2 requires more conversions back and forth by the user. Option 3 is the simplest and most flexible, but possibly harder to understand.". For now, I'm leaning towards Option 1 because it gives the best user experience to get started. Only they do, we can "get feedback from actual users."

bgrant0607 commented 7 months ago

There's also still the configmap rollout issue. https://github.com/kubernetes/kubernetes/issues/22368

kptdev / kpt

Develop a way to handle application configuration #3210

3422 is relevant to this discussion