Operator Status <Upgrade Readiness> Enhancement

mhrivnak commented 4 years ago

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

But then we need an escape hatch. There will be scenarios where either the operator can't function correctly until it gets updated (so the endpoint is down), or there's a bug in its implementation of this endpoint, so maybe it's always reporting not-ready-for-upgrade. There needs to be a way to force the update even when the operator either can't be reached or says it can't be updated. Any thoughts on what that mechanism would be?

awgreene commented 4 years ago

Hello @mhrivnak,

Thanks for the detailed review!

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

Agreed - I believe if an operator defines an upgrade readiness endpoint OLM should not upgrade an operator when it is unable to connect to the endpoint. I'll update the enhancement.

But then we need an escape hatch. There will be scenarios where either the operator can't function correctly until it gets updated (so the endpoint is down), or there's a bug in its implementation of this endpoint, so maybe it's always reporting not-ready-for-upgrade. There needs to be a way to force the update even when the operator either can't be reached or says it can't be updated. Any thoughts on what that mechanism would be?

There is a manual process where the user can delete the subscription/csv to uninstall the operator and reinstall the operator at the desired version.

awgreene commented 4 years ago

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

@mhrivnak I am also going to make sure that we call out that the Upgrade Readiness Check is opt-in, meaning OLM will default to existing in the absence of an upgrade readiness endpoint, in the interest of support backwards compatibility.

exdx commented 4 years ago

I feel like at heart there is a more general issue - how does a complex component, like an operator, indicate that its ready to be upgraded (by OLM, or any other cluster management component). I feel like we can think about tackling this as both an upstream and downstream challenge - our solution can be a standard for different use cases.

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason. OLM checks for the existence of this annotation before upgrading, otherwise proceeding as usual. I'm not sure if OLM does any existing introspection on operators it installs (that may require some design). But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

awgreene commented 4 years ago

Thanks for the review @exdx

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason.

Where is this annotation applied? It would need to be some top level resource representing the installation of the opreator, likely the CSV or the Operator API that should be available soon. In either approach, we are specifically granting operators the opportunity to interact with OLM owned resources. This could introduce unintended consequences, imagine an operator with permissions to edit a CSV escalates it's RBAC privs.

But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

While simple, this is not a well defined standard. Wouldn't it be easier for users if we re-implemented the standards defined by Readiness Probes?

exdx commented 4 years ago

Thanks for the review @exdx

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason.

Where is this annotation applied? It would need to be some top level resource representing the installation of the opreator, likely the CSV or the Operator API that should be available soon. In either approach, we are specifically granting operators the opportunity to interact with OLM owned resources. This could introduce unintended consequences, imagine an operator with permissions to edit a CSV escalates it's RBAC privs.

But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

While simple, this is not a well defined standard. Wouldn't it be easier for users if we re-implemented the standards defined by Readiness Probes?

My idea was to have the annotation on the operator pod itself - this is something we actually did when we wrote a real operator at my last job. Using annotations to indicate that something was ready to change, be torn down, or be updated. This approach would require OLM to discover the annotations on the operator pods that it manages before attempting an upgrade - which wouldn't be too bad since we know the name of the operator from the CSV and can just query the pod directly. I agree that asking the operator author to know about and modify the CSV or operator object is too specific and not a good idea.

The probe idea sounds good in principle but what does it entail - kubelet performs probes, not OLM. A probe describes a health check to be performed against a container to determine whether it is alive or ready to receive traffic. I don't see how we can use them for operator upgrades

awgreene commented 4 years ago

Updated to capture some more recent thoughts - still needs to incorporate all the great feedback.

kevinrizza commented 4 years ago

/lgtm

exdx commented 4 years ago

LGTM, good job listing out all the alternative approaches in detail. /lgtm

awgreene commented 4 years ago

I am worried that while attempting to make this generically usable for all types of operators, whether they have knowledge of OLM or not, we are making more work for the ones that are aware of OLM and want to integrate.

I do not agree that we are creating any additional work for Operator Authors to integrate with this feature in OLM. If OLM introduced a CRD that it owned an operator would need to:

Identify at run time if it is being managed by OLM.
Write conditions to the CRD that OLM introduced.

If an operator is able to write to the status of a CRD that they own, they would just need to:

write conditions to the status of the CRD they introduced. This could be an existing CRD or a net new one that they just write operator status to.

I think that @ecordell or @kevinrizza or someone else at some point mentioned combining this current proposal with the first alternative to provide that experience. I would like to explore how much more work that would be, and if we can mitigate the downside of reporting status in two places.

Could you define the downsides? Why are operators required to write status in two places if they are just writing to a CRD that they introduce to communicate status?

@shawn-hurley

ecordell commented 4 years ago

/lgtm

openshift-ci-robot commented 4 years ago

New changes are detected. LGTM label has been removed.

awgreene commented 4 years ago

Closing this in favor of #42 with the expectation that we will eventually revisit the discussion on how to provide operators with the means to communicate complex states to OLM without writing to an OLM CRD directly.

operator-framework / enhancements

Operator Status <Upgrade Readiness> Enhancement #23