Disable '"additionalProperties": false' in `values.yaml` schema for cluster apps.

piontec commented 5 months ago

As we learned, some of our apps (so far we've found only some cluster apps) have in their values.schema.json file the property "additionalProperties": false, (example. This means, that if at any time someone will add a new global value to one of our catalog config maps or to the config repo and this value will be pushed to such an app, it won't install, as helm's validation (which can't be disabled if the schema file is present) will fail.

It seems that currently the recommended way is to allow additionalProperties and validate (and show a warning) before passing the config to helm.

CC @marians

erkanerol commented 5 months ago

In cluster-vsphere and cluster-cloud-director, we don't have additionalProperties:false in the root level but it seems we have it for some section of the schema such as internal.

Should we remove it completely from all levels? I am not fully sure.

AndiDog commented 5 months ago

This means we do no more checking of fields and customers + our engineers can easily make typos and mistakes. The schema becomes partially useless. People don't normally look at warnings, and CIs don't look at warnings at all. We also shouldn't add new, special validations for this case because then we're not using standard Helm commands anymore which may hinder our migration to pure Helm in the future.

Hence, here are some alternative suggestions:

In the cluster-xyz chart schema, reference a "Giant Swarm shared" JSON schema (e.g. by "$ref": "https://raw.githubusercontent.com/giantswarm/our-new-shared-schema/main/cluster-chart.schema.json"). We already previously had this idea of sharing schemas (in #wg-cluster-app-schema, I guess). Make affected teams CODEOWNER in that new repo so that breaking changes are avoided by having more reviewers (e.g. Honeybadger + Turtles + Phoenix).
Worse than the above idea: add a global.shared object which has additionalProperties=true so anything could go in there. Any new shared fields can be added in there. This still provides no validation at all, so I don't like this idea very much.

erkanerol commented 5 months ago

When we set additionalProperties=false, we will not be able to catch typos or wrong keys if they are not marked as required.

When we keep additionalProperties=true, we have to upgrade all cluster-<provider> apps to the latest before releasing a change in the GitOps or an operator.

I believe the latter is worse than the former one. For example, when you need to add a new parameter to the configmap created by cluster-apps-operator, you don't want to have to upgrade all clusters in advance. That would slow us down a lot.

nprokopic commented 4 months ago

TL;DR We should make a clear distinction between configuration of "regular" apps (that deploy controllers/operators) and cluster-\<provider> (and default-apps-\<provider>) apps. They are used for different things and "global" configs (changed "globally") have quite different impacts. Universal config for "regular" apps (that deploy controllers/operators) should IMO not be universal for and shared with cluster-\<provider> (and default-apps-\<provider>) apps as well.

any time someone will add a new global value to one of our catalog config maps or to the config repo

When it comes to config being added to catalog config maps and to config repo, we should really make a clear distinction between "regular" apps (that deploy controllers/operators) and cluster-\<provider> apps that contain CAPI+CAPx+HelmRelease+App resources (and default-apps-\<provider> apps that go hand in hand with cluster-\<provider> apps and will be soon merged into cluster-\<provider> apps).

"Regular" apps, those that deploy controllers/operators:

They come from the control-plane, giantswarm, or default catalog.
- There is a great benefit to have a single place to push config for all apps from these catalogs, because here we're talking about many dozens of apps at least (or 100+ maybe?).
They get configured from the config repo.
Updating a config of some app will just trigger an update of that app. One app might (or not) restart and that's it.
- This is perfectly fine on the MC.
- It's a bit less fine on the WC (depending on which app we're talking about and if the app got restarted), but at least the impact is limited to a single app (or multiple apps, depending on how many apps use that config).

On the other hand, cluster- apps (default-apps-\<provider> apps will get merged into cluster-):

They come from the cluster catalog.
- Only cluster-\<provider> apps and default-apps-\<provider> apps come from this catalog.
- We have just 4 providers, so just 4 pairs of cluster-\<provider> + default-apps-\<provider> (soon just 4 cluster-\<provider> apps), so it's quick and easy to update these when needed, so IMO here we don't waste doing doing needed changes.
- The "regular" apps that we deploy from these by applying HelmRelease/App on the MC (e.g. cilium, core-dns, external-dns, cert-manager, etc) still come from "regular" app catalogs mentioned above (control-plane, giantswarm, or default catalog), so catalog config maps for those catalogs still work regularly for the "regular" apps that we deploy here from cluster-\<provider> apps and default-apps-\<provider> apps.
They don't get configured from the config repo.
Updating config of a cluster-\<provider> app can easily trigger a rolling update of a workload cluster.
- e.g. registry config for k8s components is in KubeadmControlPlane (resource in cluster-\<provider> app), so changing that will trigger a rolling update of a workload cluster.
- I don't think that we want to enable that global config changes (for all apps) trigger rolling update for almost all of our customer's workload clusters.

Now when it comes to cluster-\<provider> apps schema:

This means we do no more checking of fields and customers + our engineers can easily make typos and mistakes. The schema becomes partially useless. People don't normally look at warnings, and CIs don't look at warnings at all. We also shouldn't add new, special validations for this case because then we're not using standard Helm commands anymore which may hinder our migration to pure Helm in the future.

💯 what Andreas wrote. JSON schema is natively integrated into Helm and this is very helpful. It gives us a strong validation and it's our 1st line of defence from incorrectly configured workload clusters. With JSON schema we can write simple Helm templates, because we can rely on the JSON schema validation and assume that Helm values that we are using in the template is just correct. Templates become easier to understand and maintain that way.

AFAIK cluster-\<provider> apps (and default-apps-\<provider>) get values injected from these places only:

cluster catalog config map. The values we have here are:
- provider: not used anywhere and useless, because cluster-\<provider> apps (and default-apps-\<provider>) know which provider they are.
- baseDomain: currently we use the same baseDomain for all workload clusters in a single MC, but in CAPI AFAIK we want to move away from this, and enable customers to have a separate base domain for every WC (if they want). So base domain can be specified on cluster creation, but even if it stays in cluster catalog config map it's an immutable config that cannot change.
- managementCluster: as far as I could find, just used in Teleport config. Also this is immutable config.
<cluster name>-cluster-values config map that is created in the cluster-apps-operator.
- Once we merge default-apps-\<provider> into cluster-\<provider>, a lot (most) of the config that is currently set in <cluster name>-cluster-values will be easy to set directly in the cluster-\<provider> apps (so an operator will not have to set those values).
- Most importantly, it's the "regular" apps from cluster-\<provider> apps and default-apps-\<provider> that need <cluster name>-cluster-values, cluster-\<provider> apps and default-apps-\<provider> themselves do not need <cluster name>-cluster-values.

So the only really useful/used config that gets injected into cluster-\<provider> apps and default-apps-\<provider> are base domain and management cluster name that get injected from cluster catalog config map, and both are immutable. So even this is more about defaulting some immutable value (for the sake of convenience I guess), and not about making something configurable.

When we keep additionalProperties=true, we have to upgrade all cluster- apps to the latest before releasing a change in the GitOps or an operator.

@erkanerol In these cases, which IMO should be very rare, AFAIK in cluster-apps-operator you can always check for cluster-\<provider> and default-apps-\<provider> version, and set the required values based on those versions, which IMO we should be even doing now, so we don't mutate existing workload clusters config with an MC component change. This problem existing here is IMO an architecture smell.

T-Kukawka commented 4 months ago

i am having hard to to conclude what shall we do here?

nprokopic commented 2 months ago

i am having hard to to conclude what shall we do here?

When it comes to cluster-\<provider> apps, IMO we should not do anything :)

To quote my TL;DR from above:

TL;DR We should make a clear distinction between configuration of "regular" apps (that deploy controllers/operators) and cluster-\<provider> (and default-apps-\<provider>) apps. They are used for different things and "global" configs (changed "globally") have quite different impacts. Universal config for "regular" apps (that deploy controllers/operators) should IMO not be universal for and shared with cluster-\<provider> (and default-apps-\<provider>) apps as well.

giantswarm / roadmap

Disable '"additionalProperties": false' in `values.yaml` schema for cluster apps. #3187