operator-framework / enhancements

Apache License 2.0
9 stars 40 forks source link

Add "Upgrading" status to the condition CR #49

Closed nunnatsa closed 3 years ago

nunnatsa commented 4 years ago

A pod readiness probe is used in Openshift to communicate the readiness of the service, but in OLM it also means "upgrade is done" for operators. This is sometimes creates conflicts when the upgrade is a very long process, longer than the readiness probe timeout. This conflicts may cause OLM to kill and restart the operator pod, in some cases.

PR #42 introduces the condition CR - a way for the operator to communicate with OLM. This PR suggests a new condtion, Upgrading, to try to solve the above conflict, by moving the "upgrade is done" detection from the readiness probe to the new conditon.

ecordell commented 3 years ago

Hi @nunnatsa -

The issue you're describing is what the enhancement intends to address.

  • Issue: How can OLM distinct between an upgrade process that was not started yet, and one that was already completed.

The way that this is addressed in the original enhancement is by setting Upgradeable=false rather than Upgrading=true.

In the example here, you would simply set:

  apiVersion: operators.coreos.com/v1
  kind: Condition
  metadata:
    name: foo-operator-zdgen-asdf 
    namespace: operators
  status:
    conditions:
    - type: "Upgradeable" # The name of the `foo` OLM Supported Condition.
      status: "False"   # The operator is now performing upgrade.
      reason: "migration"
      message: "The operator is performing a migration to version v2.3.4"
      lastTransitionTime: "2020-08-24T23:15:55Z"

This tells OLM it that should not upgrade the operator itself (upgradeable=false)n because it is currently performing an upgrade.

nunnatsa commented 3 years ago

Thanks @ecordell. I still not sure it's working. When Upgradeable condition is True and the Upgrading is False, you can't tell if the operator didn't start the upgrade process yet, or already completed it.

ecordell commented 3 years ago

It may help to think of Upgradeable=False as Interruptible=False.

In other words, OLM does not care what the operator is actually doing (in this case, it sounds like upgrading some operands). It only cares that it should not move on to the next update for the operator, because that would interrupt some important work the current one is doing.

If an operator typically has migrations to perform, it should likely ship in the Upgradeable=False state, and only flip to true once it's had a chance to start up and check that everything is properly up to date.

Does that make sense?

nunnatsa commented 3 years ago

Not yet. OLM, regarless to this enhacement and the one in #42, checks the readiness probe to find out if the upgrade is done.

This design causes some issues when the upgrade takes too long.

After the introducing of the condition CR, I want to use it in order to signal "upgrade done" to OLM instead of the current usage in the readiness probe.

But this alone is not enough, as described above.

ecordell commented 3 years ago

checks the readiness probe to find out if the upgrade is done.

The conditions described in #42 are designed to replace the readiness probe check. Operators using that feature will not be upgraded based on their readiness probe status

nunnatsa commented 3 years ago

The conditions described in #42 are designed to replace the readiness probe check. Operators using that feature will not be upgraded based on their readiness probe status

Thanks @ecordell - please add this to the document. I'm closing this PR.