Evaluate an higher level abstraction on top of the Reconcile Loop

andreaTP commented 2 years ago

The Reconcile Loop provides a basic building block that allows writing simple operators.

When an operator starts to perform some "more advanced" operations the logic can easily become messy, here I would like to explore what the SDK can offer to help structuring the code of more complex workflows.

To start the discussion I can put on the table an early experiment I ran to model the Reconciliation loop as a Finite State Machine: https://github.com/andreaTP/poc-operator-mutable-jar/blob/34d65d9626b7a1202ba1a8197366f49e2f2c0aa6/src/main/java/org/keycloak/ControllerFSM.java#L16-L21

This implementation is based on an opinionated decision:

use (at least part of) the Status field of the CR to store the status

Running the example you can already see in action a few of the "benefits" of this approach:

The workflow description is extremely evident and clear
Easier to read (at least for me since I have done a ton of Actor programming 😛 )
The steps of the workflow are isolated and they can be run/be tested in isolation
Easy debugging: when something fails you need to debug only the current state and not the entire reconcile loop
A primitive but working control over state transitions

There are various additional possibilities if we decide to build on top of this foundation, such as:
The retry mechanism of the SDK can be configurable on each state
The compensation action caused by a failing state can be defined specifically for each state
Timeouts bound to specific states
Configurable automatic logging of the states and the transitions
...

csviri commented 2 years ago

What I claim that we need is very similar to the constructs of terraform (analogy is not that nice since terraform has sync execution). So easily and as declaratively as possible support an directed acyclic graph (DAG) of resources, which migh depend on each other. And based on that the reconciliation can be executed easily by a scheduler in the background. But basically yes a state machine in the background. But the point is, its a declarative DAG of resources.

Please take a look on a similar issues which is now @metacosm working on (and related PR: https://github.com/java-operator-sdk/java-operator-sdk/pull/785 )

There were discussion how to approach this higher level abstraction. I started to prototype also a solution ~ 10 months ago, then abandoned because of time constraints. But was different from the one which now @metacosm is working on, in sense that is was just an abstract reconciler on top of current abstraction layer.

But taking a step back, we had a discussion where to put such logic. If it should go directly to the core components, it should be a special Reconciler implementation, thus a different module that does not interfere with the basics "building block" or the current layers. Maybe it's worth to open also this discussion again, and / or have a design meeting together. Maybe have the core now just extendable to support such abstractions. And implement one. (Might be a good topic for Thrusday meeting community @jmrodri )

Note that we intentionally did not push this into v2 since this is a quite big topic and needs probably more discussion. But yes I think we all agree on that part that, this screams for a higher level abstraction :)

csviri commented 2 years ago

One think maybe to add, regarding the state, it does not needs to be stored (it actually should not be), we should reconcile all the resource we manage all the time. And build up the state machine in memory. Maybe just store the current result after a reconciliation into the status, but not use it as an input.

andreaTP commented 2 years ago

The idea of a DAG is super intriguing, especially if the operator needs to reach a stable "desired state" without executing "workflows" which, unfortunately, was not my POC primary use case.

I agree that this is a big change and we should start experimenting, possibly, in end user codebases to properly evaluate pros/cons before integrating on the SDK.

Pretty happy to have a discussion about the subject!

I understand that not saving the state is a safe point, I'm arguing that, with enough instruments(timeouts/ retries etc.) to handle it the end user logic can be significantly simplified, kind of following the example of Event Sourced systems.

metacosm commented 2 years ago

Please take a look at the dependent resource PR which adds another layer on top of the reconcile loop. Status handling is next in the list of features to go through.

lburgazzoli commented 2 years ago

I understand that not saving the state is a safe point, I'm arguing that, with enough instruments(timeouts/ retries etc.) to handle it the end user logic can be significantly simplified, kind of following the example of Event Sourced systems.

As I have implemented a state machine in camel-k that does exactly what is described here (well, not with a nice framework but the concept is the same), I'd say that there there's no single answer to know where/when/how keep track of the status:

if you rely only on k8s resources, then you may be able to derive the status from the platform but in some cases, especially for technical/transition states that may cause excessive chattering with the api server. Of course also saving the state machine state in the status may have the same effect
if the state machine relies on external systems, you may need to keep track of the status (i.e. the external resource is not idempotent so you can't send the same request twice)

IMHO, how the state machine state is computed, is something you should be able to customize.

csviri commented 2 years ago

if the state machine relies on external systems, you may need to keep track of the status (i.e. the external resource is not idempotent so you can't send the same request twice)

Just a little comment on this, we had discussions in the past with multiple people about where to store such state. (From OperatorSDK developers and Kubernetes Operator Slack), the conclusion was that an alternative and generally preferred pattern is to put such state in a ConfigMap, Secret or a dedicated CustomResource (for validation). So the status is only a kinda derived value from the last execution or output of the reconciliation.

(I personally have mixes feelings on this, just mentioning, will try to find the issue)

andreaTP commented 2 years ago

we had discussions in the past with multiple people about where to store such state.

Providing alternative options for storing the State seems to be highly feasible, almost a technical detail, e.g. a StoreProvider might have several implementations built on the use-cases:

no store (ephemeral)
ConfigMap
Secret
State
...

lburgazzoli commented 2 years ago

if the state machine relies on external systems, you may need to keep track of the status (i.e. the external resource is not idempotent so you can't send the same request twice)

Just a little comment on this, we had discussions in the past with multiple people about where to store such state. (From OperatorSDK developers and Kubernetes Operator Slack), the conclusion was that an alternative and generally preferred pattern is to put such state in a ConfigMap, Secret or a dedicated CustomResource (for validation). So the status is only a kinda derived value from the last execution or output of the reconciliation.

I recall that :) and I have the same feeling as you for a number of reasons.

Anyway it does not change much the logic as in fact the status is persisted somewhere in a resource so, it is just a matter to give to the "state" enough freedom.

scrocquesel commented 2 years ago

Does State mean also the observed spec ? Actually for external dependency, it is not always possible to set a reference to the CR. Thus, if the CR spec change while the operator is down, you can lose track of a previously created dependency.

What i'm doing is to push a copy of spec in .status.observedSpec, and when polling per resource, I will fetch the dependent resource that match first the observedSpec if present or spec. The reconciliation will run because observedGeneration is older than the current one and reconciler will be able to see what changed by comparing dependent resource version with the current spec.

csviri commented 2 years ago

Does State mean also the observed spec ?

As you describe observedSpec, it makes sense to solve this in general. But you can set also just specific information that is needed to be able to reconcile the external resources. But as mentioned there are different views on this, that if the state actually should be part of status or rather an other resource.

What i'm doing is to push a copy of spec in .status.observedSpec, and when polling per resource, I will fetch the dependent resource that match first the observedSpec if present or spec. The reconciliation will run because observedGeneration is older than the current one and reconciler will be able to see what changed by comparing dependent resource version with the current spec.

Generally this works, I think. Not sure if it's a generally desired pattern in all cases when dealing with external resources. But I have no strong opinion on that. Again there are opinions that is should be rather a Secret where the current state stored.

scrocquesel commented 2 years ago

Generally this works, I think. Not sure if it's a generally desired pattern in all cases when dealing with external resources. But I have no strong opinion on that. Again there are opinions that is should be rather a Secret where the current state stored.

It's not only about external resources. It's also the case when the target resource is not the representation of the CR spec. For e.g., I have a oauthclientpolicy resource used to edit an existing oauthclient resource to configure part of it (say an array of allowed scope). Many oauthclientpolicy resources can contribute to one oauthclient resource.

When the oauthclientpolicy change, we need to know which scopes were added. The spec, is not enough, because if an update fails the spec become out of sync. We must persist some kind of an undo operation state (which is also used for cleanup) Maybe this can be generalised in some way.

csviri commented 2 years ago

@scrocquesel This makea sense, I'm not sure how we would generalise that, but would discuss that under a separate issues and discuss there the use cases. Like making some case study. Could create like "automatic state management" issue ? Thx

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

csviri commented 2 years ago

Will close this since this is now solved since this should be now covered by dependent resources.

operator-framework / java-operator-sdk

Evaluate an higher level abstraction on top of the Reconcile Loop #843