Closed hasbro17 closed 3 years ago
@dmesser Going to update the proposal with a section to indicate the cleanup progress in the Subscription status (https://github.com/operator-framework/enhancements/pull/46#discussion_r483582732).
Need some more clarification on the UX of the timeout option (https://github.com/operator-framework/enhancements/pull/46#discussion_r483997369) and introducing a potential cleanup option in the CSV as well (https://github.com/operator-framework/enhancements/pull/46#discussion_r483999064).
Per the above feedback I'm going to update the proposal to not make the default behavior be operator removal after a timeout on cleanup. The requirement for timeouts was coming from a dubious acceptance criteria (thanks for pointing that out @mhrivnak). That should help address the issues with timeouts.
And also make the cleanup option a knob in the CSV so it works similar to how subscription config lets you override default CSV options.
Rewrote a fair bit of the proposal to make the following changes per the previous comments.
Added the following updates:
Making updates for the latest set of reviews but currently blocked on a resolution for https://github.com/operator-framework/enhancements/pull/46#discussion_r521080770
Since we have consensus on ensuring the operand cleanup behavior is always off by default (https://github.com/operator-framework/enhancements/pull/46#discussion_r521080770), I'll be adding another section to mention this requirement and outline some ways in which this might be done (overwrite the CSV config when writing to the InstallPlan, overwrite when CSV is in initial phase).
Updates for the latest round of reviews.
UX changes:
spec.cleanup.supported
field that must be set to convey an operator’s capability to cleanup before the user can opt-in. A spec field seemed more appropriate for this than an annotation which is for metadata.Cleanup
, although we can do without this if a new phase introduces complications. Can be done later.Wording/content changes for clarifications:
operatorframework.io/operand-cleanup
Owned APIServices
to be Owned CRDs
.
Owned APIServices
as a possible future goal since that wasn’t a goal of this proposal./approve
/hold
Based on concerns raised by @dmesser there is still some disagreement over the latest change that adds an extra CSV field (spec.cleanup.supported
) to make the opt-in process a 2 phase step. The concern being that this makes the cleanup feature harder to use and hence less likely to be utilized for existing operators that don't really need or support finalizers.
Waiting until we can resolve this or get consensus on being okay with that trade off. See: https://github.com/operator-framework/enhancements/pull/46#discussion_r525248931
Updated the proposal with another subsection that outlines how a user can override and force cleanup to happen when the operator does not explicitly support it. The idea being that the API should enforce a secondary confirmation step from the user in the event of an override.
This follows up from the discussion on not having OLM automatically cleanup if the operator doesn't support it. See: https://github.com/operator-framework/enhancements/pull/46/#discussion_r542953772
If the workflow seems fine, I'm open to suggestions on the renaming the API spec.cleanup.confirmationKey
.
PTAL @kevinrizza @bparees @ecordell
/unhold
Will be updating to make the confirmation workflow the default and not only just applicable when the user is trying to override cleanup.
Will be updating to make the confirmation workflow the default and not only just applicable when the user is trying to override cleanup.
if you do that, how will a user understand whether the operator actually supports cleanup or not? Making the confirmation key always required is fine. But i think we also need an explicit acknowledgement from the user that i understand this operator doesn't claim to support cleanup and therefore may leave resources around even when i ask for cleanup on uninstall, but i am asking for it anyway
@bparees From @kevinrizza's initial feedback it seemed that always requiring a "copy and write" confirmation was a simpler workflow than having 2 different workflows based on whether the operator actually supports cleanup or not. Also even if an operator says it has cleanup capability, OLM can't guarantee that everything will get cleaned up. So always requiring the user do a "copy and write" step to enable cleanup should ensure that the user always has to drive the cleanup action (and presumably pause and think about what they expect the operator to do).
As for conveying the operator's stated cleanup capability, we can still convey that information somewhere on the CSV. Currently it's the spec.cleanup.supported: true/false
field which the CSV author was expected to set.
However with the above amendment, it has no bearing on the cleanup workflow so I imagine this could be better conveyed as an annotation instead of a spec field.
So something like:
kind: CSV
metadata:
annotations:
cleanupSupported: <"true"/"false"/"unknown">
spec:
cleanup:
# Step 1: Set before CSV deletion to opt-in to cleanup
enabled: true
# Step 2: Copy and write status key to confirm cleanup
confirmationKey: <copy and write from status key>
# - CSV deletion and cleanup will be blocked until key matches status key.
# - Or user opts out of cleanup before key confirmation.
status:
# Generated when enabled=true
confirmationKey: zAQHP_EW2ct8bFpT7JMw
Alternatively we could also remove spec.cleanup.enabled
and just make "copy and write key" as the sole step which simplifies the above workflow by making the opt-in and the confirmation the same.
E.g:
kind: CSV
metadata:
annotations:
cleanupSupported: <"true"/"false"/"unknown">
spec:
# Step 1: Set before CSV deletion to opt-in to cleanup
cleanupKey: <copy and write from status key>
# - No cleanup if unset i.e cleanupKey=""
# - CSV deletion and cleanup blocked until key matches.
status:
# Always generated
cleanupKey: zAQHP_EW2ct8bFpT7JMw
Although that's not really a 2 step confirmation workflow. So I could use some feedback on whether we want the former or latter.
The two-step confirmation requirement is unrelated to expectations about what the operator will/won't cleanup, so i view that as entirely orthogonal to the discussion. I agree that it's a sensible precaution against loss of data any time the user requests cleanup(regardless of what the operator supports).
And while certainly an operator can claim to do cleanup, but then fail at it or not do it well, there is still a big difference between an operator that supposedly handles cleanup but doesn't do it right(bug!), and one that does not claim to do so(not a bug)
I think it is a mistake to allow someone to set "cleanup=true" on an operator that doesn't declare it implements cleanup, without the user explicitly acknowledging this fact (assuming that just because they're in the CSV editing, they will notice the absence of an annotation, doesn't seem sufficient to me).
At a minimum i will expect the UI to warn the user that the operator does not explicitly support cleanup, if the user requests cleanup during uninstall, even if that doesn't translate into an api field that needs to be set. I'm willing to cede a little ground to someone editing the CSV directly because presumably they read the api docs and the api docs better explain that "while you can set this to true, the operator may or may not implement it correctly, check the supports.cleanup annotation on the csv to confirm whether it supports cleanup")
Agreed on the UI surfacing both "opt into cleanup" and "override/force cleanup even though the operator doesn't support it".
I think it is a mistake to allow someone to set "cleanup=true" on an operator that doesn't declare it implements cleanup, without the user explicitly acknowledging this fact (assuming that just because they're in the CSV editing, they will notice the absence of an annotation, doesn't seem sufficient to me)
That sounds reasonable enough, which means we want "explicit user confirmation when the operator does not support cleanup". If you want to bake this behavior into the API then the choice of the 2 step confirmation does becomes relevant IMO.
Since the current proposal has exactly this outlined, (i.e copy-and-write the key only if cleanup is unsupported) I'll hold off my proposed changes for making the confirmation workflow the default and circle back on this with @kevinrizza
That sounds reasonable enough, which means we want "explicit user confirmation when the operator does not support cleanup". If you want to bake this behavior into the API then the choice of the 2 step confirmation does becomes relevant IMO.
what i'm suggesting is that:
1) to request cleanup, you must set cleanup=true 2) to request cleanup from an operator that doesn't support it, you must also set forcecleanup=true 3) when uninstalling an operator that has cleanup requested correctly(whether explicitly supported or not), you can then set a confirmation key in status that they must set in spec.
so (3) is orthogonal to whether the operator explicitly supports it or not.
Rewrote the API section to address the following suggested changes:
spec.cleanup.startKey
spec.cleanup.force
field that is required to override and force cleanup when the operator does not support it spec.cleanup.supported: false
Open to naming suggestions on the API fields for better clarity e.g spec.cleanup.enableKey
.
Per our discussions outside this thread I've reverted the CSV API back to the simpler toggle spec.cleanup.enabled
for opt-ins to cleanup.
Indicating operator cleanup capabilities to distinguish between supported cleanup, and override/force when unsupported, are no longer in scope for the proposal.
/approve
@hasbro17 in talking with @spadgett we realized this EP doesn't cover what happens when cluster-scoped CRDs are present? I assume all cluster-scoped CRs will be removed during uninstall if operand cleanup is requested.
@bparees Yeah cluster-scoped CRs would imply a single cluster-scoped operator that's reconciling them. I'm not sure if there's a valid setting where you could have two operators reconciling exclusive sets of cluster scoped CRs (by dividing them up on labels/annotations).
So yeah I would say it's safe to say if the CRD is cluster-scoped, then on uninstall + cleanup OLM can safely remove all cluster-scoped CRs of that type. If that makes sense, I can make a PR to outline this case as an amendment to the proposal.
it makes sense to me, though there are definitely use cases (as we discussed before i think) where this gets tricky because a CRD is indeed "shared". E.g. something like ingress where every ingress operator implementation might define the same cluster CRD, but only watch ones w/ specific annotations.
If any one of those ingress operators gets uninstalled and the admin says "clean up operands", the result would be all ingress being deleted across the cluster for all ingress types, which would be very bad.
I'm not sure what we do about that short of warning admins to be very careful with this option, though.
Enhancement proposal for OLM handling cleanup of operands on operator uninstall.