Separation of module lifecycle and business logic statuses

janmedrek commented 8 months ago

Created on 2024-03-06 by Jan Medrek (@janmedrek).

Decision log

Name	Description
Title	Separation of module lifecycle and business logic statuses.
Due date	2024-03-31
Status	Accepted on 2024-03-31
Decision type	Binary
Affected decisions	None

Context

In the current design, the "state" of modules is quite blurry - it is a mix of both deployment health coming from the Kyma Lifecycle Manager and business logic status that is monitored by the module operator. This causes the need to synchronize the value of this field and take into consideration multiple factors that do not fall under the same category. The overall situation is that it is not clear what is the current state of both deployment health and the business logic of the module.

Decision

A clear separation should be introduced between two state types for each module:

Manifest State - standing for the overall health of K8S resources that the Module consists of. This state is managed solely by the Kyma Lifecycle Manager, it contains information on whether the K8S manifest was applied correctly and deployment is running or not.
Module State - representing the state of "business logic" of the module. This field is under the complete control of the Module Operator and can be used to communicate the current condition of the module to the User.

Then the following values should be used in the State fields:

Manifest State
- Ready - Manifest was applied correctly and the workload is running
- Processing - Lifecycle Manager waits for the pods to get up/K8S to apply changes
- Deleting - The manifest is under deletion, most likely waiting for some blocking resource removal
- Error - There was an issue that is not able to be resolved automatically and a grace period/number of retries was used
Module State
- Error - An error or problem has occurred. The message contains actions you need to perform to resolve the issue. Example: You configured an expired certificate.
- Warning - A statement of fact that alerts you to a potential mistake. You may proceed with your action despite the warning. Example: You configured a certificate which is about to expire
- Ready - No errors or warnings
- Deleting/In Progress - Module Operator is performing actions

Consequences

Module state that is reflected in the Manifest CR's status.state field will now be used to represent the Manifest Status. This means that it will only contain information on the success/failure of the k8s resource application in the cluster and whether the deployment remains in a healthy state or not.

The Kyma CR will be also affected in the following ways:

every module listed in the status.module section will contain the state field managed by the KLM and representing the Manifest Status
every module listed in the status.module section will contain the reference to the specific Module CR that corresponds to that module
status.state will be an aggregation of the Manifest Statuses of all modules enabled within that Kyma

Then for the "business" status of the modules - the operator should manage the corresponding Module CR and update the state field there to indicate the Module Status. This is a field that can be presented to the user to notify them whether the module logic is running smoothly or if some kind of manual action is needed.

There should be a clear recommendation/contract on the values used for the Module Status, including:

As an example, here is the Warning status of Eventing CR, which indicates the backend was not specified and the module is not ready to handle events:

status:
  activeBackend: ''
  conditions:
    - lastTransitionTime: '2024-03-26T07:01:17Z'
      message: Backend config is not provided. Please specify a backend.
      reason: BackendNotSpecified
      status: 'False'
      type: BackendAvailable
  specHash: 0
  state: Error

And a status section from the corresponding Kyma CR, which includes Ready for both eventing and overall Kyma state as all the modules were deployed successfully and the workload specified in the Module Manifest runs properly:

  modules:
    - channel: fast
      [...]
      name: eventing
      resource:
        apiVersion: operator.kyma-project.io/v1alpha1
        kind: Eventing
        metadata:
          name: eventing
          namespace: kyma-system
      state: Ready
      [...]
  state: Ready

k15r commented 8 months ago

Do i understand this correctly:

the module state from KLMs point of view is only maintained in the manifest cr which is not visible to the user
the modules businesslogic state is maintained by only by the operator and is visible to the user

if so: i totally agree with the decision to have the operator maintain the business logics state and expose it to the user. This way misconfigurations can be clearly communicated to the user. But hiding the modules state in the manifest which is not visible to the user will lead to confusion. A user enables a module, but it cannot be brought to a working state (KLMs view). The user needs to be able to see this condition (eg in the kyma cr in their cluster)

janmedrek commented 8 months ago

@k15r yes, you got it right.

Regarding your concerns - I would not "hide" the Manifest Status from the user, we can display it along with the Module Status in the Busola. We just need to figure out a transparent way so it is clear "which is which".

All the Manifest Statuses will be available in Kyma CR (and aggregated to the overall Kyma State), thus the separation will be:

In Module CR there is the "business status"
In Kyma CR there is the "manifest status"

So all of those will be clearly visible to the user.

pbochynski commented 8 months ago

Can you add a simple example of status field of Kyma CR and corresponding module CR? I think it would be easier to grasp the idea seeing the yaml content :)

a-thaler commented 8 months ago

About the module status, I think here it is important to drill it down to the 3 colors of a traffic light, that the user is used to interpret. As the error status will be available now as a business state, I suggest to have:

Error - An error or problem has occurred. The message contains actions you need to perform to resolve the issue. Example: You configured an expired certificate.
Warning - A statement of fact that alerts you to a potential mistake. You may proceed with your action despite the warning. Example: You configured a certificate which is about to expire
Running/Healthy - No errors or warnings

An explicit "Action required" might not really needed as the error state is actionable enough.

janmedrek commented 8 months ago

@pbochynski I've added an example. I am a bit worried it may not be as transparent if both resources use state to indicate essentially two different things.

Perhaps it would be better to have something like status.moduleState in the ModuleCR?

@a-thaler thanks for the suggestion, I've added it to the proposal.

valentinvieriu commented 8 months ago

We are actually trying to unify the status across resources. Have a look how we plan to deal with them in UI https://www.figma.com/file/AkzF9I1INdwc5iUtm5fKNU/Redesign-%26-Modules?type=design&node-id=7548-81549&mode=design&t=0pDZj0xsjmYAwSj9-0

valentinvieriu commented 8 months ago

@janmedrek can you add https://github.com/kyma-project/busola/issues/2624 the details of the outcome? maybe some usecases

janmedrek commented 8 months ago

Regarding the naming clash, we could switch the KLM-related state to Availability or AvailabilityState instead of plain State. There would be also an argument behind using something more sophisticated than the plain State for the user-facing part. I'd like to discuss that with the team as well.

As for values, it would make sense to support Ready/Available, Error/Unavailable, and Processing/Pending.

NHingerl commented 8 months ago

About the module status, I think here it is important to drill it down to the 3 colors of a traffic light, that the user is used to interpret. As the error status will be available now as a business state, I suggest to have:

Error - An error or problem has occurred. The message contains actions you need to perform to resolve the issue. Example: You configured an expired certificate.

Warning - A statement of fact that alerts you to a potential mistake. You may proceed with your action despite the warning. Example: You configured a certificate which is about to expire

Running/Healthy - No errors or warnings

An explicit "Action required" might not really needed as the error state is actionable enough.

This proposal matches the UA/UX guidelines for BTP; I fully support this.

kyma-project / community