kyma-project / lifecycle-manager

Controller that manages the lifecycle of Kyma Modules in your cluster.
http://kyma-project.io
Apache License 2.0
10 stars 30 forks source link

If module under deletion LifecycleManager is not updating Manager version #833

Closed PK85 closed 9 months ago

PK85 commented 1 year ago

Description

For example, the user decided to uninstall btp-manager. BtpOperator CR blocks that operation and returns information to a user to delete all ServiceInstance and ServiceBindings first. After two months, btp-manager itself had of date version cause LM is not bumping versions.

Expected result

BTP Manager is always up to date.

Actual result

BTP Manager is on version from the date when user disabled module.

Steps to reproduce

Create SKR, enable btp-operator, create fake Service Instance, disable btp-operator.

Troubleshooting

ruanxin commented 1 year ago

This actually leads to a more general issue - how we treat resources under deleting.

After following the step to reproduce, the root cause of LM preventing all resources (not only manager deployment) from being updated is because it is still pending at the step to delete module CR (btpoperator), which can be observed in manifest events.

Events:
  Type     Reason     Age                   From                                Message
  ----     ------     ----                  ----                                -------
  Warning  PreDelete  43m (x145 over 112m)  declarative.kyma-project.io/events  deletion of custom resource was triggered and is now waiting to be completed
  Warning  PreDelete  50s (x409 over 40m)   declarative.kyma-project.io/events  deletion of custom resource was triggered and is now waiting to be completed

The question is if the user disabled one module and the module CR is stuck under deleting (deletionTimestamp set), then user enables this module again, how do we deal with that module CR?

The old module CR should be deleted first, but who should do this?

I need some opinion to make a decision here. @PK85 @pbochynski

PK85 commented 1 year ago

@ruanxin I agree with you that the situation you described must be discussed, but still when module is under deletion LM should reconcile BTP manager to the newest version. Do you agree?

PK85 commented 1 year ago

@ruanxin about your questions, I think that LM should never (maybe only SKR deletion could be the exception)remove finalizers in such scenarios.

Only the User can decide what to do in such a situation. That means deciding to remove a reason (for btp operator, remove SI/SB) or if want to keep the module working remove the finalizer and enable the module again.

That means if LM sees a situation that module is under deletion, and on the other side user again added it to the spec, then LM waits till the module is deleted, and then installs that again.

ruanxin commented 1 year ago

In my opinion, we should not allow these inconsistent resources between versions (new operator handle old module CR), it may bring incompatible issue if the module CR has introduced breaking changes in the future. It may never happen for BTP manager, but for other operators, we don't know, introducing it as a general process in LM is a bit dangerous. @PK85

PK85 commented 1 year ago

Only keeping Managers up to date will provide a way to keep module CR up to date. You gave an example that Managers handle old module CR, but this will never happen, cause every module version should take care about any breaking changes introduced to CR etc, adjust it and make it valid. @ruanxin

PK85 commented 1 year ago

Another issues that are present right now:

pbochynski commented 1 year ago

My answers:

ruanxin commented 1 year ago

Can I summarize the current decision like below? @PK85 @pbochynski

For module CR stuck under deleting:

  1. If module template is still in the existing version, LM prevents users from enabling this module again until they manually resolve that stuck situation.

  2. If new module template version exists, LM forces upgrade module operator and other related resources (like RBAC) but does not install module CR ( I think it's also impossible to install the module CR because it's under deleting).

Update (ignore above statement)

For module CR stuck under deleting: LM does not prevent the user from enabling module, but the module state is the same as before. This also means, if a new module template version exists, LM will upgrade the module operator and other related resources (like RBAC, module CRD).

Regarding the module CR, after creation, should not be updated by LM, which is the current implementation.

ruanxin commented 1 year ago

Update

@PK85, after further concern, I think for your scenario, it really makes sense let your module operator create your module CR, then the lifecycle of the module CR is completely owned by your module, LM doesn't care if the module CR is stuck under deleting when user disables module, it will delete other module resources and ignored your module CR.

The moduleCR is an optional field default-cr when you create module with Kyma CLI, in the new generated module template, you will see there will be no spec.data exists, compares to the existing one.

PK85 commented 12 months ago

@ruanxin BTP Manager deployment should not create or delete own CR. It should be created by someone else e.g. User or LM as it's now. I don't understand why you do not want to keep it as we discussed. That means LM creates CR once and then always make updates to the newest available module template regardless of the CR status.

ruanxin commented 12 months ago

@PK85 , The thing is if module CR is managed by LM, then the generic logic to deal with module disable is like below:

  1. delete module CR, wait until it's fully removed
  2. delete other resources, including module Operator

So the logic here is very clear. In most cases, module operator should deal with the lifecycle of module CR, which means set/remove the finalizer of module CR.

Even if a user re-enables the module during deletion, as long as the module CR can eventually be deleted by operator, LM can proceed with step 2 to remove old resources entirely and create new ones, which means after one module is enabled, its resources are always up-to-date.

However, the BTP Manager presents a special scenario. It only sets the finalizer and does not control the removal of the finalizer. Instead, the removal of the finalizer is indirectly controlled by users who must manually remove dependent resources before the BTP Manager can remove the finalizer.

So the deleting logic is blocked by the user in the first step I mentioned above, then it makes sense to me, ignoring the module CR entirely in this context. It's important to emphasize that this isn't a workaround, it's a valid scenario that is supported by the current LM implementation, and for your case, it's actually a perfect fit, which also means LM doesn't require any additional special logic to handle this situation.

But now I really need you to tell me a bit more why BTP Manager should not create CR (I understand not delete CR, it should be controlled by the user). From a coding perspective, it's a very simple logic in your operator, detect if the module CR is there, if not, create it, otherwise, continue.

However, if the argument against the BTP Manager creating the CR is really convincible, here's an alternative solution: we can introduce some policy in ModuleTemplate to tell LM that the default CR can be ignored during deletion, this will also work. CustomResourcePolicyIgnore = "Ignore" works here also, but it's configured in Kyma spec, which configured by user, it's not a solution here unfortunately.

I hope it's also answered your question, it's not about revoking our discussion, the proposal I'm giving exactly addresses the situation that LM only care about your module CR during creation (or can be ignored) and does not care about deletion, so that operator still possible align with the lifecycle of module, if disabled, operator is removed, if enabled, operator created.

To emphasize one thing, we must avoid a situation where the module is in the process of being deleted while still allowing updates to the operator and other related resources. Introducing such logic would create significant complications in the lifecycle management of our manifest CR.

If your module CR has a finalizer configured, it will not be deleted even if operator is gone, I hope this is clear. So your argument that the module is in the process of deletion but the operator should continue to exist and stay up-to-date is not valid. If we can decouple the deletion process between the operator and the module CR in your specific case, this problem would not exist. Because your operator is completely gone, leaving only the module CR in the process of deletion. When a user re-enables the module, LM can deploy the latest operator without any conflicting issues.

pbochynski commented 12 months ago

Guys, let me quickly summarize the discussion in the concise way. What should happen for create, delete and update scenarios. Assumption: CustomResourcePolicy=CreateAndDelete

Module enabled in Kyma spec

  1. LM deploys module manifest (CRD, deployment, roles, etc)
  2. LM creates default CR for module

Module template updated (module is already enabled)

  1. LM deploys module manifest (CRD, deployment, roles, etc) Note: module CR is never updated by LM, once created it is under control of user

Module deleted from Kyma spec

  1. LM deletes CR and waits for deletion
  2. LM deletes module manifest

Module enabled in Kyma spec while module CR in the deleting state

  1. LM deploys module manifest (deployment, roles, CRD) (it should be already there, so it is more an update operation)
  2. LM creates default CR for module (but it can be done only when deleted, so it waits until CR is deleted)

Module template updated (while module is during deletion)

  1. LM deploys module manifest (CRD, deployment, roles, etc) Note: the same behavior like for regular update

@ruanxin If there is still something unclear please call me :)

janmedrek commented 11 months ago

Then my understanding of the deletion flow would be as follows:

  1. LM deletes CR (deletionTimestamp is created)
  2. LM waits until Module CR is gone
  3. LM drops all the Module-related resources from the cluster

In the meantime, all the resources listed in the Module Manifest are kept up-to-date with the current template.

That would mean update logic should not be tied to the deletion process, and even if the Module status is set as "Deleting" the updates should continue (with the exclusion of the Module CR).

The problematic part would be - what will happen if there are any updates to the CRD? I guess in such cases (like a version change) it would need a manual action.

kyma-bot commented 9 months ago

This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

You can:

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

kyma-bot commented 9 months ago

This issue or PR has been automatically closed due to the lack of activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

You can:

If you think that I work incorrectly, kindly raise an issue with the problem.

/close

kyma-bot commented 9 months ago

@kyma-bot: Closing this issue.

In response to [this](https://github.com/kyma-project/lifecycle-manager/issues/833#issuecomment-1820732643): >This issue or PR has been automatically closed due to the lack of activity. >Thank you for your contributions. > >This bot triages issues and PRs according to the following rules: >- After 60d of inactivity, `lifecycle/stale` is applied >- After 7d of inactivity since `lifecycle/stale` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle stale` > >If you think that I work incorrectly, kindly [raise an issue](https://github.com/kyma-project/test-infra/issues/new/choose) with the problem. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.