opendatahub-io / opendatahub-operator

Open Data Hub operator to manage ODH component integrations
https://opendatahub.io
Apache License 2.0
59 stars 127 forks source link

RHOAIENG-2974: Restrict DSCI deletion before DSC #1094

Closed Sara4994 closed 2 months ago

Sara4994 commented 2 months ago

Description

This PR fixes a race condition that happens while uninstalling the odh-operator. The issue occurs while uninstalling, at times DSCI instance gets removed first and DSC instance hangs over with error status expecting DSCI instance to be created. Manual deletion also does not remove the hanging DSC instance. This PR address the issue by blocking the deletion of DSCI before DSC through webhooks.

JIRA issue: https://issues.redhat.com/browse/RHOAIENG-2974

How Has This Been Tested?

Issue was reported as a race condition. To reproduce the race condition manually, followed the below steps

Screenshot or short clip

Screenshot 2024-07-03 at 1 31 21 AM

Merge criteria

Sara4994 commented 2 months ago

/retest

zdtsw commented 2 months ago

there are some failure in "linter" @Sara4994 i will approve once that is fixed.

Sara4994 commented 2 months ago

there are some failure in "linter" @Sara4994 i will approve once that is fixed.

@zdtsw fixed now :)

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/opendatahub-io/opendatahub-operator/blob/incubation/OWNERS)~~ [zdtsw] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
bartoszmajsak commented 2 months ago

I might be missing something fundamental, but what I instantly thought when seeing this PR is - why not cascade deletion using owner ref and finalizers? DSCI would be an owner of DSC so deleting the former (DSCI) would result in always deleting the latter (DSC). With the approach brought by this PR, we force the user to trigger those two deletions in order manually.

zdtsw commented 2 months ago

I think one reason to do this is: if user mistakenly deleted DSCI (as origin issue described) this cascading would automatically delete DSC as well which in turn all components' resource etc. the effect of this ^ would similar to uninstall operator. do we want this? if so, we need someone document this change to the user. A missing/broken DSCI can somehow mean operator is nonfunctional, but still enabled components can run as-is.

adelton commented 1 month ago

Conceptually, what is the reason for DSC and DSCI being separate entities?

If deleting the DSCI CR produces a broken cluster, aren't the two actually linked so tightly together that it makes sense to treat them as very tightly connected? If the user/customer has the Operator nonfunctional but the components (seem to) work, is that a supported configuration?

And yes, I understand that implications of this would need to be figured out / investigated. So the webhook approach might be a reasonable stop gap.

zdtsw commented 1 month ago

Conceptually, what is the reason for DSC and DSCI being separate entities?

If deleting the DSCI CR produces a broken cluster, aren't the two actually linked so tightly together that it makes sense to treat them as very tightly connected? If the user/customer has the Operator nonfunctional but the components (seem to) work, is that a supported configuration?

And yes, I understand that implications of this would need to be figured out / investigated. So the webhook approach might be a reasonable stop gap.

The very old idea (i was not involved) is to have one DSCI for one cluster, but we could have multiple DSC:s in one cluster (I can only guess it is prep for the multi-tenant case) . But later on, this had been changed which makes it only supports one DSC and one DSCI in one cluster. I do not know if any plan to continue with multiple DSC in the (near) future or we continue with 1:1:1 map, but yes, use webhook is only for "DSC deletion not hanging if no DSCI exist" case , till we re-define the relation between DSC and DSCI, e.g using Ownerreference or not. cc @VaishnaviHire / @etirelli