Open thetechnick opened 3 years ago
Thanks for the submission and detailed writeup @thetechnick
I would like to propose to have OLM discover and read status conditions from those CRDs that are already in use by operators.
This was a part of the original enhancement that was dropped only for time. Perhaps @awgreene has the older versions of that enhancement handy, so that we could re-instate those parts as future work (perhaps as a separate enhancement?)
Hey @ecordell if you can point me towards existing groundwork, I would be happy to contribute to this feature.
I was thinking a bit more about locking flows, because I would like to have an Upgrade lock that I can rely on.
EDIT head hurts from thinking about distributed logging for too long: To boil it down as much as possible, I can think of 2 paths:
Just doing a brain-dump here for future reference:
kind: OperatorCondition
metadata:
generation: 4
spec:
conditions:
# reported from Operator
# in .spec to increment .metadata.generation
- type: Upgradable
status: "True"
status:
conditions:
# OLM Acknoleged the Upgrade Lock
- type: Acknowledged
status: "True"
observedGeneration: 4
# Outdated Upgrade Lock
- type: Acknowledged
status: "True"
observedGeneration: 3
# Denied Upgrade Lock
- type: Acknowledged
status: "False"
reason: UpgradeInProgress
message: an upgrade process for the operator has already started
observedGeneration: 4
Pro:
OperatorCondition
Con:
# example singleton Operator CRD kind
kind: HyperConverged
metadata:
generation: 5
annotations:
# OLM updates this field to ask
# wether it's safe to upgrade
# updates here will bump .metadata.generation
#
# OLM needs to wait for .status.observedGeneration or
# .status.condition[i].observedGeneration to sync up.
#
# integer to just trigger reconcile/condition update
operator-sdk.io/can-upgrade: "5"
spec: {}
status:
conditions:
# Operator ack-ed upgrade is Ok
- type: Upgradable
status: "True"
observedGeneration: 5
# Outdated upgrade ok
- type: Upgradable
status: "True"
observedGeneration: 3
# Upgrade not ok right now
- type: Upgradable
status: "False"
reason: BlockingVMMigration
message: coordinating fleet VM migration, please wait...
observedGeneration: 5
Pro:
Con:
Problem Description
Informing OLM about the status of a deployed operator is very important to ensure complex migrations and procedures are not interrupted. While interruptions/operator pod restarts should not lead to issues, upgrading an Operator during a complex migration could lead to inconsistencies when the Operator cannot pick up the migration where it has left of.
At time of writing OLM already allows Operators to report Status information via the
OperatorCondition
CRD. https://olm.operatorframework.io/docs/advanced-tasks/communicating-operator-conditions-to-olm/Context: I am submitting this issue for consideration as developer on the Hyperconverged Cluster Operator (HCO from now on). Before joining this team though, I developed multiple operators that would run into the same issues.
Consistency
A common pattern, especially for cluster-wide operators is the use of a singleton CRD instance to hold configuration for the operator itself. This is also the place where information about the operators status (conditions, events, etc) are recorded and exposed to users. In some cases operators require the consultation of this status before operators (human or otherwise) upgrade the operator itself. [citation needed]
When those operators don't support managing multiple
Thing Instances
on the cluster, they may merge configuration of the Thing they are managing as a cluster-wide deployment into their operator configuration. This is also the case for HCO.Now to the core issue:
Important here is
generation
/observedGeneration
. As soon as "someone" is making a spec change, the condition is invalidated until the controller could report on the new state. To illustrate:Why this is so convenient: Without doing any complex gymnastics, I as a developer can be sure, that until my code confirms that the new spec is not changing (or is changing) conditions, nobody should act on outdated information.
Mirroring this status information into another object makes synchronization way more difficult and introduces a few pitfalls.
Upgradable
condition, introducing a huge complexity burden on the operator.OperatorCondition
, so as a developer I can't be sure that the "lock" is committed. Making relying on it flaky.*[citation needed] Searching for an example, but the Runbook of the operator I have in mind is not public. :)
Deployment complexity
We want upstream operators to still work on Kubernetes, so we have support 3 different deployment models:
This makes it even harder to upstream and test changes.
Double writes to kube-apiserver/etcd
We can't get rid of our own CRD, as it holds our configuration and we need it for upstream. That means we have to update 2 objects for every condition change.
Proposed Solution
I would like to propose to have OLM discover and read status conditions from those CRDs that are already in use by operators. (
CoolOperator
CRD in the example under Consistency)In the concrete, I would like to be able to tell OLM the name and type of a CRD instance to check the status conditions, as alternative to reporting them on the
OperatorCondition
object.In case of HCO
operatorframework.io/initialization-resource
is even used to create that instance in the first place.Thanks for reading! Looking forward discussing this in more detail.