Closed Sara4994 closed 2 months ago
/retest
there are some failure in "linter" @Sara4994 i will approve once that is fixed.
there are some failure in "linter" @Sara4994 i will approve once that is fixed.
@zdtsw fixed now :)
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: zdtsw
The full list of commands accepted by this bot can be found here.
The pull request process is described here
I might be missing something fundamental, but what I instantly thought when seeing this PR is - why not cascade deletion using owner ref and finalizers? DSCI would be an owner of DSC so deleting the former (DSCI) would result in always deleting the latter (DSC). With the approach brought by this PR, we force the user to trigger those two deletions in order manually.
I think one reason to do this is: if user mistakenly deleted DSCI (as origin issue described) this cascading would automatically delete DSC as well which in turn all components' resource etc. the effect of this ^ would similar to uninstall operator. do we want this? if so, we need someone document this change to the user. A missing/broken DSCI can somehow mean operator is nonfunctional, but still enabled components can run as-is.
Conceptually, what is the reason for DSC and DSCI being separate entities?
If deleting the DSCI CR produces a broken cluster, aren't the two actually linked so tightly together that it makes sense to treat them as very tightly connected? If the user/customer has the Operator nonfunctional but the components (seem to) work, is that a supported configuration?
And yes, I understand that implications of this would need to be figured out / investigated. So the webhook approach might be a reasonable stop gap.
Conceptually, what is the reason for DSC and DSCI being separate entities?
If deleting the DSCI CR produces a broken cluster, aren't the two actually linked so tightly together that it makes sense to treat them as very tightly connected? If the user/customer has the Operator nonfunctional but the components (seem to) work, is that a supported configuration?
And yes, I understand that implications of this would need to be figured out / investigated. So the webhook approach might be a reasonable stop gap.
The very old idea (i was not involved) is to have one DSCI for one cluster, but we could have multiple DSC:s in one cluster (I can only guess it is prep for the multi-tenant case) . But later on, this had been changed which makes it only supports one DSC and one DSCI in one cluster. I do not know if any plan to continue with multiple DSC in the (near) future or we continue with 1:1:1 map, but yes, use webhook is only for "DSC deletion not hanging if no DSCI exist" case , till we re-define the relation between DSC and DSCI, e.g using Ownerreference or not. cc @VaishnaviHire / @etirelli
Description
This PR fixes a race condition that happens while uninstalling the odh-operator. The issue occurs while uninstalling, at times DSCI instance gets removed first and DSC instance hangs over with error status expecting DSCI instance to be created. Manual deletion also does not remove the hanging DSC instance. This PR address the issue by blocking the deletion of DSCI before DSC through webhooks.
JIRA issue: https://issues.redhat.com/browse/RHOAIENG-2974
How Has This Been Tested?
Issue was reported as a race condition. To reproduce the race condition manually, followed the below steps
Screenshot or short clip
Merge criteria