operator-framework / enhancements

Apache License 2.0
9 stars 40 forks source link

OLM deletes operands enhancement #46

Closed hasbro17 closed 3 years ago

hasbro17 commented 4 years ago

Enhancement proposal for OLM handling cleanup of operands on operator uninstall.

hasbro17 commented 4 years ago

@dmesser Going to update the proposal with a section to indicate the cleanup progress in the Subscription status (https://github.com/operator-framework/enhancements/pull/46#discussion_r483582732).

Need some more clarification on the UX of the timeout option (https://github.com/operator-framework/enhancements/pull/46#discussion_r483997369) and introducing a potential cleanup option in the CSV as well (https://github.com/operator-framework/enhancements/pull/46#discussion_r483999064).

hasbro17 commented 3 years ago

Per the above feedback I'm going to update the proposal to not make the default behavior be operator removal after a timeout on cleanup. The requirement for timeouts was coming from a dubious acceptance criteria (thanks for pointing that out @mhrivnak). That should help address the issues with timeouts.

And also make the cleanup option a knob in the CSV so it works similar to how subscription config lets you override default CSV options.

hasbro17 commented 3 years ago

Rewrote a fair bit of the proposal to make the following changes per the previous comments.

hasbro17 commented 3 years ago

Added the following updates:

hasbro17 commented 3 years ago

Making updates for the latest set of reviews but currently blocked on a resolution for https://github.com/operator-framework/enhancements/pull/46#discussion_r521080770

hasbro17 commented 3 years ago

Since we have consensus on ensuring the operand cleanup behavior is always off by default (https://github.com/operator-framework/enhancements/pull/46#discussion_r521080770), I'll be adding another section to mention this requirement and outline some ways in which this might be done (overwrite the CSV config when writing to the InstallPlan, overwrite when CSV is in initial phase).

hasbro17 commented 3 years ago

Updates for the latest round of reviews.

UX changes:

Wording/content changes for clarifications:

kevinrizza commented 3 years ago

/approve

hasbro17 commented 3 years ago

/hold

Based on concerns raised by @dmesser there is still some disagreement over the latest change that adds an extra CSV field (spec.cleanup.supported) to make the opt-in process a 2 phase step. The concern being that this makes the cleanup feature harder to use and hence less likely to be utilized for existing operators that don't really need or support finalizers.

Waiting until we can resolve this or get consensus on being okay with that trade off. See: https://github.com/operator-framework/enhancements/pull/46#discussion_r525248931

hasbro17 commented 3 years ago

Updated the proposal with another subsection that outlines how a user can override and force cleanup to happen when the operator does not explicitly support it. The idea being that the API should enforce a secondary confirmation step from the user in the event of an override.

This follows up from the discussion on not having OLM automatically cleanup if the operator doesn't support it. See: https://github.com/operator-framework/enhancements/pull/46/#discussion_r542953772

If the workflow seems fine, I'm open to suggestions on the renaming the API spec.cleanup.confirmationKey.

PTAL @kevinrizza @bparees @ecordell

hasbro17 commented 3 years ago

/unhold

hasbro17 commented 3 years ago

Will be updating to make the confirmation workflow the default and not only just applicable when the user is trying to override cleanup.

bparees commented 3 years ago

Will be updating to make the confirmation workflow the default and not only just applicable when the user is trying to override cleanup.

if you do that, how will a user understand whether the operator actually supports cleanup or not? Making the confirmation key always required is fine. But i think we also need an explicit acknowledgement from the user that i understand this operator doesn't claim to support cleanup and therefore may leave resources around even when i ask for cleanup on uninstall, but i am asking for it anyway

hasbro17 commented 3 years ago

@bparees From @kevinrizza's initial feedback it seemed that always requiring a "copy and write" confirmation was a simpler workflow than having 2 different workflows based on whether the operator actually supports cleanup or not. Also even if an operator says it has cleanup capability, OLM can't guarantee that everything will get cleaned up. So always requiring the user do a "copy and write" step to enable cleanup should ensure that the user always has to drive the cleanup action (and presumably pause and think about what they expect the operator to do).

As for conveying the operator's stated cleanup capability, we can still convey that information somewhere on the CSV. Currently it's the spec.cleanup.supported: true/false field which the CSV author was expected to set. However with the above amendment, it has no bearing on the cleanup workflow so I imagine this could be better conveyed as an annotation instead of a spec field.

So something like:

kind: CSV
metadata:
  annotations:
    cleanupSupported: <"true"/"false"/"unknown">
spec:
  cleanup:
    # Step 1: Set before CSV deletion to opt-in to cleanup
    enabled: true
    # Step 2: Copy and write status key to confirm cleanup
    confirmationKey: <copy and write from status key>
    # - CSV deletion and cleanup will be blocked until key matches status key. 
    # - Or user opts out of cleanup before key confirmation.
status:
  # Generated when enabled=true
  confirmationKey: zAQHP_EW2ct8bFpT7JMw 

Alternatively we could also remove spec.cleanup.enabled and just make "copy and write key" as the sole step which simplifies the above workflow by making the opt-in and the confirmation the same. E.g:

kind: CSV
metadata:
  annotations:
    cleanupSupported: <"true"/"false"/"unknown">
spec:
  # Step 1: Set before CSV deletion to opt-in to cleanup
  cleanupKey: <copy and write from status key>
  # - No cleanup if unset i.e cleanupKey=""
  # - CSV deletion and cleanup blocked until key matches.
status:
  # Always generated
  cleanupKey: zAQHP_EW2ct8bFpT7JMw 

Although that's not really a 2 step confirmation workflow. So I could use some feedback on whether we want the former or latter.

bparees commented 3 years ago

The two-step confirmation requirement is unrelated to expectations about what the operator will/won't cleanup, so i view that as entirely orthogonal to the discussion. I agree that it's a sensible precaution against loss of data any time the user requests cleanup(regardless of what the operator supports).

And while certainly an operator can claim to do cleanup, but then fail at it or not do it well, there is still a big difference between an operator that supposedly handles cleanup but doesn't do it right(bug!), and one that does not claim to do so(not a bug)

I think it is a mistake to allow someone to set "cleanup=true" on an operator that doesn't declare it implements cleanup, without the user explicitly acknowledging this fact (assuming that just because they're in the CSV editing, they will notice the absence of an annotation, doesn't seem sufficient to me).

At a minimum i will expect the UI to warn the user that the operator does not explicitly support cleanup, if the user requests cleanup during uninstall, even if that doesn't translate into an api field that needs to be set. I'm willing to cede a little ground to someone editing the CSV directly because presumably they read the api docs and the api docs better explain that "while you can set this to true, the operator may or may not implement it correctly, check the supports.cleanup annotation on the csv to confirm whether it supports cleanup")

hasbro17 commented 3 years ago

Agreed on the UI surfacing both "opt into cleanup" and "override/force cleanup even though the operator doesn't support it".

I think it is a mistake to allow someone to set "cleanup=true" on an operator that doesn't declare it implements cleanup, without the user explicitly acknowledging this fact (assuming that just because they're in the CSV editing, they will notice the absence of an annotation, doesn't seem sufficient to me)

That sounds reasonable enough, which means we want "explicit user confirmation when the operator does not support cleanup". If you want to bake this behavior into the API then the choice of the 2 step confirmation does becomes relevant IMO.

Since the current proposal has exactly this outlined, (i.e copy-and-write the key only if cleanup is unsupported) I'll hold off my proposed changes for making the confirmation workflow the default and circle back on this with @kevinrizza

bparees commented 3 years ago

That sounds reasonable enough, which means we want "explicit user confirmation when the operator does not support cleanup". If you want to bake this behavior into the API then the choice of the 2 step confirmation does becomes relevant IMO.

what i'm suggesting is that:

1) to request cleanup, you must set cleanup=true 2) to request cleanup from an operator that doesn't support it, you must also set forcecleanup=true 3) when uninstalling an operator that has cleanup requested correctly(whether explicitly supported or not), you can then set a confirmation key in status that they must set in spec.

so (3) is orthogonal to whether the operator explicitly supports it or not.

hasbro17 commented 3 years ago

Rewrote the API section to address the following suggested changes:

Open to naming suggestions on the API fields for better clarity e.g spec.cleanup.enableKey.

hasbro17 commented 3 years ago

Per our discussions outside this thread I've reverted the CSV API back to the simpler toggle spec.cleanup.enabled for opt-ins to cleanup. Indicating operator cleanup capabilities to distinguish between supported cleanup, and override/force when unsupported, are no longer in scope for the proposal.

kevinrizza commented 3 years ago

/approve

bparees commented 3 years ago

@hasbro17 in talking with @spadgett we realized this EP doesn't cover what happens when cluster-scoped CRDs are present? I assume all cluster-scoped CRs will be removed during uninstall if operand cleanup is requested.

hasbro17 commented 3 years ago

@bparees Yeah cluster-scoped CRs would imply a single cluster-scoped operator that's reconciling them. I'm not sure if there's a valid setting where you could have two operators reconciling exclusive sets of cluster scoped CRs (by dividing them up on labels/annotations).

So yeah I would say it's safe to say if the CRD is cluster-scoped, then on uninstall + cleanup OLM can safely remove all cluster-scoped CRs of that type. If that makes sense, I can make a PR to outline this case as an amendment to the proposal.

bparees commented 3 years ago

it makes sense to me, though there are definitely use cases (as we discussed before i think) where this gets tricky because a CRD is indeed "shared". E.g. something like ingress where every ingress operator implementation might define the same cluster CRD, but only watch ones w/ specific annotations.

If any one of those ingress operators gets uninstalled and the admin says "clean up operands", the result would be all ingress being deleted across the cluster for all ingress types, which would be very bad.

I'm not sure what we do about that short of warning admins to be very careful with this option, though.