Revisit Autopilot w.r.t. target discovery in the Plan API

Autopilot's Plan CRD distinguishes controllers and workers in its target discovery. Referencing parts of the example in the documentation:

apiVersion: autopilot.k0sproject.io/v1beta2
kind: Plan
metadata:
  name: autopilot
spec:
  commands:
  - k0supdate:
      targets:
        controllers:
          discovery:
            static:
              nodes:
              - ip-172-31-44-131
              - ip-172-31-42-134
              - ip-172-31-39-65
        workers:
          discovery:
            selector:
              labels: environment=staging
              fields: metadata.name=worker2

From a user perspective, where to put controller nodes that also run workloads (i.e. controller nodes using --enable-worker)? The current API leaves that unspecified. From k0s's perspective, it seems logical to list them in the controllers section, since, in the end, those nodes have been started via the controller subcommand, but from a Kubernetes perspective, those nodes are also appearing as a regular Node. In fact, Autopilot can update them when listed as controllers. But there's no safeguards in Autopilot that verify the discovered nodes in any way, it just assumes the node discovery is correct.

This ambiguity can lead to misconfigurations. Nodes may be listed in the wrong category, or both. These can be difficult to detect, especially with selector-based discovery where it's not immediately clear which nodes are being selected. K0s doesn't currently verify the discovered nodes in any way. It simply tries to continue with the plan as it is. The impact of such misconfigurations, whether it leads to nodes being processed multiple times or potentially causing update or even cluster failures, needs further investigation. Regardless of the impact, this is something k0s should be more helpful with.

As a first step, the documentation should clearly state that nodes should be categorized based on the subcommand used to start them (controller or worker), regardless of additional flags.

A possible next step could be some verification during plan processing. K0s can check that the section in which nodes are listed is actually appropriate, and it can check for nodes that are listed more than once. It can then reject the plan with an error message stating the problem. Based on this, users can correct their configuration and there are fewer surprises.

Another possibility would be to silently auto-fix the discovered nodes. K0s could deduplicate them and automatically put them into the correct category. I'm not in favor of this approach, since it requires more "magic" that is not obvious to users, since the actual course of action would be different from what was planned. This could also surprise users.

Ultimately, looking at this second approach, I'd say that the distinction between controllers and workers in the Discovery API is inherently unnecessary. Autopilot can simply inspect the node role and figure out the right things to do on its own. A long-term solution might be to revise the API and remove the distinction. This would make things clearer for both users and implementers. Unfortunately, k0s is not (yet) prepared to handle changes to its Kubernetes API. So there's some challenges here. Nevertheless, k0s would benefit from being able to support multiple API versions, as there are other APIs that would also benefit from some changes, especially the ClusterConfig CRD.

See:

3739

k0sproject / k0s

Revisit Autopilot w.r.t. target discovery in the Plan API #3750

3739