Disassemble the Component interface

Is your feature request related to a problem? Please describe.

The component.Component interface is the backbone of k0s's internal structure:

https://github.com/k0sproject/k0s/blob/71a2c707b7dd64a622160c27c0b72e1bc574a775/pkg/component/component.go#L25-L30

It serves as a lifecycle abstraction for all the different things in k0s. However, its contract is not very well defined, which makes it hard to judge what to do when implementing it.

What is the anticipated order in which the lifecycle methods are being called and what are the expectations about concurrency safety? There has been some effort in documenting the lifecycle in k0sproject/k0s#1657, but it's still not clear enough. Are components expected to be concurrency safe, or is it the responsibility of its users to ensure that access to the lifecycle methods is synchronized?
Many components don't need Init, Stop or Healthy implementations. Some even don't need a Run method, since they only act on Reconcile. Especially the Healthy method is only used in the Manager when starting components.
Context handling and its interaction with Init/Run/Stop behavior. Currently, both the Init and the Run method take a context parameter. It is not entirely clear how those contexts should be used, given that there's also the Stop method. Are components expected to stop by themselves if any of those contexts is cancelled (a.k.a the "Merge" problem)? If yes, why does the Stop method exist?

Describe the solution you would like

The key purpose of the Component interface is to drive the lifecycle of multiple components via the Manager. It is responsible to call the "main" lifecycle methods Init, Run, Stop and Healthy. It does it in a sequential manner w.r.t. each individual component. For those methods, it seems reasonable to document that they don't need to be concurrency safe. Then there's the Reconcile method. It's special in a way that it's not really a lifecycle method, but a special method to be called "out of band" on each registered component. Some more complex components have even more public methods, all of them may be called at different times, potentially concurrently. Let's clearly document that: as long as a component only exposes the fundamental lifecycle methods, it doesn't need to care about concurrency. But as soon as it's doing more than that, it should.
Instead of combining all of the methods into one interface, let's split them up into individual interfaces. There's already the specialized ReconcilerComponent interface for those components that require/support reconciliation. The Manager simply uses type assertions to figure out if a component implements it or not. How about having an Initializable and Stoppable interface for those components which support those lifecycle phases, and let the Manager use type assertions to figure it out. This would remove a lot of empty function implementations from the codebase, and would also obliterate the question of whether a component should actually fail if Stop was called before Run when it cannot be stopped anyway (same for Run before Init, when Init is a no-op). This approach is also extensible, when there are more lifecycle phases to be added in the future, there could be just a new interface for those. Not all existing components would need to implement it. Concerning the Healthy method: Given that only the Manager uses it when starting components, I would argue that we can remove it completely. Instead, for those components that have a reasonable Healthy implementation, it should be embedded into the Run method. Run can then block until the component becomes healthy. The timeout loop implemented in the Manager's waitForHealthy method could be transformed into a context that gets cancelled if Run doesn't return in time.
```
type Initializable interface {
    Init(context.Context) error
}

type Component interface {
    Run(context.Context) error
}

type Stoppable interface {
    Stop() error
}

type Reconcilable interface {
    Reconcile(context.Context, *v1beta1.ClusterConfig) error
}
```
Let's define the context passed to Init as a short-lived one that may get cancelled after Init returns. The Run method would better be called Start, since it's actually starting something, not running it (i.e. it's not blocking until stopped). The stop method is currently being used to block the Manager until a component is actually stopped. It combines both the cancellation request and the blocking until cancelled into one method. How about using the context passed to Start as the cancellation signal, and returning an error channel that may be used to block until the component has actually been stopped? This would require us to make the Manager implementation a bit more elaborate, but It would make implementing Components easier. I consider this a plus, since there's only one Manager, but many components.
```
type Initializable interface {
    Init(context.Context) error
}

type Component interface {
    Start(context.Context) (<-chan error, error)
}

type Reconcilable interface {
    Reconcile(context.Context, *v1beta1.ClusterConfig) error
}
```

Describe alternatives you've considered

No response

Additional context

No response

Sounds good to me in general. More fine grained interfaces will make component implementations much tolerable and removes the guess work if something needs to be implemented or not. 👍

The only part that I'm not really sure is the using of error channel as return value on Start(context.Context) (<-chan error, error). I get that it can be used to block the Manager until the component is really stopped but feels bit wrong to use error channel for that purpose. I mean does the manager really care/want to see errors emitted from a component? If we use it only to notify manager that the component is really stopped, why not use some bool/empty struct channel for this purpose? Or are there real use cases where Manager would really like to receive errors from running components?

I mean does the manager really care/want to see errors emitted from a component?

Good question. The current implementation for Stop looks like this:

https://github.com/k0sproject/k0s/blob/63f4bd1c739fa915eade5470fd5be740bfa657d4/pkg/component/manager.go#L100-L121

So it stops the components sequentially in reverse order, logging any errors. The "done" channel returned from Start is basically a 1:1 asynchronous replacement of the synchronous Stop method on the Component interface. The <-chan error type stems directly from the error return type of Stop. Sure, one could argue why this isn't Stop() bool or simply Stop(). The component manager itself doesn't really do something with the returned error, besides logging it, and returning an error on its own if any of the managed components failed to stop. The returned error of the manager is again just logged and discarded afterwards.

We could completely remove the error (i.e. return <-chan struct{}), remove the error handling and move the error logging into for each component's Stop implementation.

I'd rather not do this. I think that "error passing" is a solid, future-proof strategy for functions that aren't infallible. "You don't know what to do with this failure? Let your caller handle it!". Whenever we actually need some special handling in the future, we'd need to refactor the Component interface again, touching all of its implementors. It might already be useful in unit tests, to check if the component stops correctly or not.

I'd change the manager's implementation to not log anything, but return some composite error, as we do for errors during reconciliation. Then we can still log it when stopping the manager (and I'd consider to terminate the k0s command with a non-zero exit code in case it failed to shutdown cleanly).

Well, I would rather think about this from the lifecycle manager perspective. It has to do a few things:

initialize a component if needed;
start it;
and stop it;

Init is optional, since it can be done at the beginning of the starting process, but it could be convenient to have everything prepared in advance.

Renaming Run to Start sounds good to me since it's much closer to what it actually does.

And we can't get rid of the Stop function, since at some cases we might need to preserve the reversed order (eg. the metrics server should be stopped after the API server). Also, we have the following phrase in the comments:

The given context is not intended to replace a call to Stop when canceled. It's merely used to cancel the component's startup.

Reconciling is a completely separate process and it's not coupled with the Component interface at all.

So, we can move the Init method to the separate interface and make it optional as it has been done with the Health function, but then we will have some inconsistency and two different interfaces, which won't make things easier in my opinion.

https://github.com/k0sproject/k0s/pull/2059

And we can't get rid of the Stop function, since at some cases we might need to preserve the reversed order (eg. the metrics server should be stopped after the API server).

The manager needs to have control over the stopping behavior, true. The proposal changes the way it's done from two distinct Start and Stop methods to a single Start method with returns some sort of "done" channel. Stopping can still be done in reverse order. The manager can have one context per component, and cancel each of those sequentially (cancel component context, await done channel, go on to next component).

I see the following advantages of the single start method approach over the double start/stop method approach:

There's some cognitive dissonance between the context that's passed to Start and the Stop method. What is the difference for a component between the cancellation of the context and calling the Stop method? I'd argue there's none. Implementing a component that behaves this way requires quite some boilerplate. It has to keep its own context as a struct filed, or spawn some goroutine that calls Stop whenever the context gets cancelled. The manager retains the possibility to wait for a component to be shutdown via the done channel.
There's no way to call Stop from a wrong lifecycle state. A component can only be stopped by cancelling the context that has been passed to Start.
Many components are one-shot anyways, they don't need to implement Stop at all.

The proposal changes the way it's done from two distinct Start and Stop methods to a single Start method with returns some sort of "done" channel. Stopping can still be done in reverse order. The manager can have one context per component, and cancel each of those sequentially (cancel component context, await done channel, go on to next component).

Yes, it will have the same behavior as we have now, the only difference is that we'll need to close the "done" channel instead of explicitly calling the Stop function. But we also have to store all these channels in some intermediate array/slice and have to have a "stop" goroutine in the component. I personally prefer explicit API like this and make the manager take control over the order of calling the Stop function:

type Component interface {
   Start(context.Context) error
   Stop(context.Context) error
}

over some kind of this:

type Component interface {
   Start(context.Context) (done chan<- bool, err error)
}
// --------------
…
// --------------
func (c *MyComponent) Start(context.Context) (chan<- bool, error) {
    …
    done := make(chan<- bool)
    go func() {
         <- done
        c.Stop()
    }
    return done, nil
}

There's some cognitive dissonance between the context that's passed to Start and the Stop method. What is the difference for a component between the cancellation of the context and calling the Stop method? I'd argue there's none. Implementing a component that behaves this way requires quite some boilerplate. It has to keep its own context as a struct filed, or spawn some goroutine that calls Stop whenever the context gets cancelled. The manager retains the possibility to wait for a component to be shutdown via the done channel.

Context has nothing to do with stopping a component. And shouldn't, since it's just a context. Even context cancellation is just an agreement, not a contract. Components don't have to respect the context at all (and most of them don't), but they have to respect the contract, ie interface. The difference between the cancellation of the context and calling the Stop method is that Stop function we can call synchronously and wait for it to complete.

Many components are one-shot anyways, they don't need to implement Stop at all.

I don't think it's a problem. It's quite ok if some components don't need some function or another. More importantly, we have a simple and clear contract, which is easy to use and implement. The same I can say about the Init function. I would prefer an explicit way of saying "I don't want to do anything here" over having php-like "magic" functions and a more complex lifecycle manager.

Context has nothing to do with stopping a component. And shouldn't, since it's just a context.

That blog post covers widely perceived weaknesses about Go's Context as it is right now. It combines several unrelated use-cases into one interface. In fact, it points out an issue with Context that k0s's Component interface tries to solve: Having an asynchronous cancellation request and awaiting its outcome. There should be an easy-to-implement way for components to support that. So I'd be perfectly okay with not using Context, but something else, as long as it's not a burden to implement in the components. (We have a lot of them, and many of those are not compliant to the requirements of the interface right now. See below.) When looking at all the different component implementations in k0s, it felt best to use channels for this purpose: One for sending the cancellation request, one for communicating the outcome when that request was processed. That's why I chose Start(ctx context.Context) (done <-chan error, err error), which could also be Start(stop <-chan struct{}) (done <-chan error, err error), of course. Although, given that the vast majority of the current Go APIs require a Context for async cancellation, it feels moot not to use a Context, since components would then need to construct their own Context again, just to call other APIs.

The Start/Stop and Start-only approaches are convertible into each other, while I think that the Start-only approach caters to more of k0s's components as the other one. There's quite some state that can be removed from the component structs, invalid states and concurrency issues are less likely to occur.

Whichever approach we choose, there should be some auxiliary structs/functions added to the codebase which help in implementing the interface, i.e. to convert between the synchronous style and the asynchronous style, depending on what the interface prescribes and the component needs. Come to think of it, there could even be a wrapper struct, so components could choose to implement the sync or async version, and component.Manager could then just wrap the non-supported interface into the supported one.

Even context cancellation is just an agreement, not a contract. Components don't have to respect the context at all (and most of them don't), but they have to respect the contract, ie interface. The difference between the cancellation of the context and calling the Stop method is that Stop function we can call synchronously and wait for it to complete.

There are quite some components in the k0s codebase which actually use the context in the proposed way already, having a no-op Stop, and quite another bunch that have extra code to convert the synchronous call to Stop into a an asynchronous cancellation request, either by wrapping the context, storing a CancelFunc and call that to stop themselves, or by storing a self-made stop channel and close that on stop. Those components don't even block until they've shutdown. So they already ignore the contract, even with an explicit Stop method. Those components would profit most from the proposal.

Many components are one-shot anyways, they don't need to implement Stop at all.

I don't think it's a problem. It's quite ok if some components don't need some function or another. More importantly, we have a simple and clear contract, which is easy to use and implement. The same I can say about the Init function. I would prefer an explicit way of saying "I don't want to do anything here" over having php-like "magic" functions and a more complex lifecycle manager.

Explicitness is okay, but nevertheless it didn't help the components to have correct implementations for Stop. Some of them have no-op implementations when they shouldn't, and others don't block properly until they actually stopped. Moreover, I wouldn't compare the optional usage of a method argument to magic functions, as it's clearly visible in the method signature, and not hidden in a context variable, thread local or similar.

Concerning the complexity of the manager: I'd happily move complexity out of the components into the manager, since we have only one manager, but dozens of components. So we just need to get it right once, not dozens of times.

To conclude: I'm okay with both approaches, having a preference for the Start-only one. Whichever the consensus is, we should adapt the component implementations to better adhere to the Component interface:

Here's a summary:

applier.Manager already uses the context from Init in the way that's suggested for the Run method, and has a no-op implementation for Stop, which violates the Component contract in many ways.
The APIEndpointReconciler wouldn't need a stop channel.
Autopilot Controller and Autopilot Worker use the context passed to Run in the suggested way already, and have a no-op Stop method.
The ClusterConfigReconciler states the proposed behavior in its no-op Stop method.
K0sControllersLeaseCounter mimics the proposed behavior by storing the CancelFunc and calling it in its Stop implementation.
CoreDNS wraps the Context passed to Run and stores a CancelFunc so that it can be called in Stop.
CSRApprover wraps the Context passed to Run and stores a CancelFunc so that it can be called in Stop.
K0sCloudProvider closes a self-made stop channel in its Stop method, which is no different from using ctx.Done().
Konnectivity wraps the Context passed to Run and stores a CancelFunc so that it can be called in Stop, and also calls supervisor.Stop().
Metrics wraps the Context passed to Run and stores a CancelFunc so that it can be called in Stop.
MetricServer wraps the Context passed to Run and stores a CancelFunc so that it can be called in Stop.
NodeRole uses the context passed to Run in the suggested way already, and has a no-op Stop method.
TunneledEndpointReconciler uses the context passed to Run in the suggested way already, and has a no-op Stop method.
telemetry.Component closes a self-made stop channel in its Stop method, which is no different from using ctx.Done().

Components that are using Supervisor and favor the synchronous way:

Some more elaborate components not fitting directly into one category:

All the other components don't need to be stopped, i.e. they have a no-op Stop method and either don't use the Context at all or only during the execution of Run itself.

I think it is outdated now?

k0sproject / k0s