k0sproject / k0s

k0s - The Zero Friction Kubernetes
https://docs.k0sproject.io
Other
3.42k stars 353 forks source link

Better handling of unsupported combos in dynamic configuration #4725

Open twz123 opened 2 months ago

twz123 commented 2 months ago

Is your feature request related to a problem? Please describe.

Some k0s configuration options are mutually exclusive, others cannot be changed after cluster creation. Currently, there is minimal sanity checking for dynamic configuration. As a result, unsupported configuration combinations can potentially break a cluster completely. Debugging such a problem is complex, as helpful error messages only appear during cluster creation. See #4721 for an example.

Describe the solution you would like

  1. Prevent invalid configurations from being stored in the cluster: Add CEL (Common Expression Language) validation rule markers to the various ClusterConfig structs. This allows validation to be performed by the API server already, preventing invalid configurations from reaching the cluster in the first place.

  2. Graceful handling of unsupported configuration values. Depending on the effectiveness of point 1, there are several options:

    1. If the configuration validation fails, k0s doesn't reconcile the configuration at all, and the components remain on the last valid configuration. This is easy and straightforward to implement. While tempting, this could undermine the reconciliation of other valid and safe configuration parts, and is therefore only a good choice if point 1 is effective, and invalid configurations stored within a cluster can be considered a pathological edge case.

    2. Try to get as close to the desired configuration as possible without breaking the cluster. This might involve "resolving" an invalid desired configuration by comparing it to the last valid one, into a "fixed" target configuration that passes validation and can be safely reconciled. This is a more elaborate and error-prone approach, and may not be necessary if point 1 proves effective.

Describe alternatives you've considered

A validating webhook is a more powerful approach to preventing invalid configurations from being stored in the cluster. At the same time, it's much heavier and more complex. It does not currently add significant value over the simpler CEL approach.

"Let it crash". Terminate the process and wait for a restart. This is not suitable as it will bring down the local API server, making it difficult to fix invalid configurations, and could also harm the entire etcd cluster in an HA scenario.

Additional context

A particular challenge is the stack applier, which is used by almost all k0s components to manage resources in the cluster. Currently, there is no good way to suspend a stack to prevent it from being applied. This may become necessary to prevent bad things from happening. Suspending stack reconciliation could be achieved by suspending leader election globally, but this might prevent partial reconciliation as discussed above.

twz123 commented 2 months ago

4674 already addresses some non-CEL validation parts for the CRDs.