Open romachalm opened 4 years ago
@romachalm Interesting idea!
Just to clarify, does your Helm release actually contain a v1.Pod
object? Or is there some parent object (e.g. a Deployment or StatefulSet) in the release?
Off the top of my head, I think the helm operator would only be able to manage CR status based on fields in the direct owned resources, not any resources further down the ownership chain.
Also, using your example, what would you expect to happen if the release manifest contained more than one Pod?
As a Watch is already registered on the dependent resources, I guess the feasibility of the change depends on a status change being catchable by DependentPredicateFuncs. Does a status change trigger UpdateEvent ?
Right now, I think DependentPredicateFuncs
is ignoring status updates of dependent objects, but we could probably update that to stop ignoring status updates when we're in the mode of supporting custom status fields.
@joelanford : you are right, my charts use parent objects, and never Pod directly (I just wanted to make it easier to understand).
But we are aligned : I just have in mind the objects the owner of which is the CRD managed by the helm-operator, in other words, only the resources deployed directly by the helm chart.
To give you a more pragmatic example : I deploy some knative services with the helm-operator (ksvc) together with some other services of our own.
ksvc
already aggregates the statuses of the underlying resources (routes, configuration, etc...).
Getting the status of the ksvc
would be configured in watches.yaml
as :
- version: v1alpha2
group: lp.ml.io
kind: Service
chart: helm-charts/lp-service
returnedStatuses:
- version: v1
group: serving.knative.dev
kind: Service
name: ready
status: ".status.conditions[?(@.type=='Ready')].status"
- version: v1
group: serving.knative.dev
kind: Service
name: reason
status: ".status.conditions[?(@.type=='Ready')].reason"
If we have several resources... well, good question. I guess a list should be constantly returned, ie something like :
apiVersion: launchpad.magicleap.io/v1alpha2
kind: Route
metadata:
[...]
spec:
[...]
status:
conditions:
[...]
deployedRelease:
manifest: |
[...]
name: test-status
status:
ready:
- True
- False
reason:
- ""
- RevisionMissing
Then, it is the responsibility of the client consuming this info to parse it correctly. Of course, for custom columns display, it will represent something raw like :
NAME READY REASON
test-status [True False] [ RevisionMissing]
I think this raw display is acceptable compared to the benefit of having the statuses displayed in the CR get
command.
Yeah a list is one option, but that seems a bit unwieldy (e.g. especially if there's one deployment you really care about, and another that you don't). I wonder if there's a way we can somehow narrow the scope to select just a single resource.
If we try to add resource name in the watches.yaml, the complication would be that the names of rendered resources are typically templated, containing at least the name of the release, so they can't be known in advance when creating the watches.yaml file.
I am aligned, we cannot write a name in the watches.yaml. What about a map then ? The user will be able to filter on the resource he cares about by its name. And he still has the ability to display everything
I think I have overthought this idea.
Let me come back to my original need : I'd like the helm operator to display the actual overall readiness of the dependent resources.
Today, when we run the helm operator, we have as status conditions the successful deployment and reconciliation of the dependent ressources. But that does not mean that the dependent resources are actually ready.
To better explain, let me have an example. Let's say that I have a helm chart deploying 2 knative services (ksvc), 2 deployments and 1 standard k8s service.
My ksvc has a specific condition Ready
which is true
when the service can effectively serve :
$ k get ksvc ready-ksvc -o jsonpath="{.status.conditions[?(@.type=='Ready')].status}"
True%
$ k get ksvc non-ready-ksvc -o jsonpath="{.status.conditions[?(@.type=='Ready')].status}"
False%
My deployments do not have such a field. But they have a condition Available
that compares the number of ready pods to the deployment definition :
$ k get deploy ready-pods -o jsonpath="{.status.conditions[?(@.type=='Available')].status}"
True%
$ k get deploy non-ready-pods -o jsonpath="{.status.conditions[?(@.type=='Available')].status}"
False%
My service has no condition allowing me to assess the readiness, so for this kind of resource, it is ready
by default.
For the two first resource kinds, I'd like to have a way to make my CRD understand that it can be considered as ready when the 2 ksvc and the 2 deployments are ready, and not ready when one of them is not.
I have imagined that we can describe those rules in watches.yaml
such as :
- version: v1alpha2
group: lp.ml.io
kind: MyService
chart: helm-charts/lp-service
readiness:
- version: v1alpha1
group: serving.knative.dev
kind: Service
jsonPath: ".status.conditions[?(@.type=='Ready')].status"
expectedValue: "True"
- version: v1beta1
group: extensions
kind: Deployment
jsonPath: ".status.conditions[?(@.type=='Availability')].status"
expectedValue: "True"
These readiness definitions can be compiled by the controller with a logical AND : for the CR to be Ready
, all those fields criteria must exists and the value must be expectedValue
. If one of them is not met, my CR cannot be considered as Ready
and a user analysis is required.
note : I think that the expectedValue field may not always be True
. Some resource conditions are negative, ie False
means ok
like for nodes :
status: "False"
type: MemoryPressure
In my example, ksvc readiness will be [true, false]
, which can be compiled as false
, deployment will be [true, false]
, which can also be compiled as false
.
So my CR should be Ready: false
Without explicitly mentioning what had failed and why, the CR will at least help monitoring that something went wrong in at least one of the dependent resource. Up to the user to analyse what had failed and why (it can be that the image cannot be pulled, that the container port is wrong, or that the container fails starting...).
I think it can simplify also the development :
By default, the ready
condition of a CR is true
(like for my standard service for instance with no condition)
DependentPredicateFuncs
triggers a reconciliation when a condition identified in jsonPath
is updated (UpdateEvent
).
This condition has to be compared to the expected value.
Whenever a condition does not meet the expectedValue, the reconciliation(reconcile.go) updates Ready
condition to false
.
Still uncertain on two topics :
predicate
(which seems to be the one being able to retireve a dependent resource status) can send this condition to reconciliation
(which can update the CR condition). Is an event
embedded in a reconcile.Request
as the source of the reconciliation ? true
or false
and that there may be a third state such as Unknown
. Should it be this status for resources that have no condition (like service
) @romachalm This seems do-able and useful. Thanks for putting so much thought into this!
I don't see yet how predicate (which seems to be the one being able to retireve a dependent resource status) can send this condition to reconciliation (which can update the CR condition). Is an event embedded in a reconcile.Request as the source of the reconciliation ?
The predicate runs after an event on a watched resource is received. It acts as a filter to determine whether or not to run the reconciliation. The actual event never makes it to the reconciler (this is on purpose, see this FAQ).
In this case, I think we would just relax the predicate to allow status change events to trigger a reconcilication, and then we would update the Helm reconciler to run the comparisons and update the status.
I guess that the readiness of a CR cannot be limited to true or false and that there may be a third state such as Unknown. Should it be this status for resources that have no condition (like service)
Hmm, in that case would it mean that all charts that deploy a service would always be ready: unknown
?
Perhaps we could tweak the config to be something like:
readiness:
- version: v1alpha1
group: serving.knative.dev
kind: Service
trueJsonPath: $.status.conditions[?(@.type=="Ready" && @.status=="True")]
falseJsonPath: $.status.conditions[?(@.type=="Ready" && @.status=="False")]
- version: v1beta1
group: extensions
kind: Deployment
trueJsonPath: $.status.conditions[?(@.type=="Availability" && @.status=="True")]
falseJsonPath: $.status.conditions[?(@.type=="Availability" && @.status=="False")]
Then we could do something like:
func getCRReadiness(obj *unstructured.Unstructured, readinessConfigs []readinessConfig) readyVal {
objJson, _ := json.Marshal(obj.Object)
var ready = readyTrue
for _, rc := readinessConfigs {
if isDefined(rc.trueJsonPath) && len(evalJsonPath(objJson, rc.trueJsonPath)) > 0) {
ready = ready.And(readyTrue)
} else if isDefined(rc.falseJsonPath) && len(evalJsonPath(objJson, rc.falseJsonPath)) > 0) {
ready = readyFalse
break
} else {
ready = readyUnknown
break
}
}
return ready
}
trueJsonPath: $.status.conditions[?(@.type=="Ready" && @.status=="True")] falseJsonPath: $.status.conditions[?(@.type=="Ready" && @.status=="False")]
I like the idea of having one single jsonPath including the value. If this criteria returns something, then the criteria is met -> ready: true
.
However, I don't think that you have necessarily a Ready: False
.
Look at the conditions of helm operator. You can have ReconcileError
or ReleaseFailed
and not Deployed: False
.
I think we can do without Unknown
and still be coherent which would be with your code (considering a transformation to groupVersionKind) :
func getCRReadiness(obj *unstructured.Unstructured, readinessConfigs []readinessConfig) readyVal {
objJson, _ := json.Marshal(obj.Object)
var ready = readyTrue
for _, rc := readinessConfigs {
if obj.GroupVersionKind() != rc.groupVersionKind {
continue
}
if isDefined(rc.trueJsonPath) && len(evalJsonPath(objJson, rc.trueJsonPath)) == 0) {
ready = readyFalse
break
}
}
return ready
}
svc
will constantly be caught by the first if statement (since it is not defined in watches.readiness), so it will get the default value Ready: true
.
ksvc
and deploy
will caught by the second if statement.
If we cannot retrieve the jsonPath, it means that either the value if false
(or the contrary of what we have defined) or the condition does not exist (because another condition has been triggered like for deployed
in helm operator). So status: false
If it exists, it won't change the default value, so status: true
We can avoid the loop if []readinessConfig
is a map with gvk.String() as key (groupVersionKind does not seem to be Comparable
) and jsonPath as value
That would make something like :
func getCRReadiness(obj *unstructured.Unstructured, readinessConfigs map[string]string) readyVal {
if rc, ok := readinessConfigs[obj.GroupVersionKind()]; ok {
objJson, _ := json.Marshal(obj.Object)
if isDefined(rc["trueJsonPath"]) && len(evalJsonPath(objJson, rc["trueJsonPath"])) == 0) {
return readyFalse
}
}
return readyTrue
}
svc
will not be part of the map, so its ready: true
Also, moving objJson inside if statement to avoid marshaling if not needed (like svc
)
Yeah you're right. I had it in my head that obj
was the CR, but it is the kind we're comparing. We would still need an outer loop that iterates the readinessConfigs to come up with the the final value though right?
Not to blow up this train of thought, but I wonder if we could come up with something more generic. For example:
statusConfig:
- key: "ready"
value: "True"
when:
- version: v1alpha1
group: serving.knative.dev
kind: Service
jsonPath: $.status.conditions[?(@.type=="Ready" && @.status=="True")]
- version: v1beta1
group: extensions
kind: Deployment
jsonPath: $.status.conditions[?(@.type=="Availability" && @.status=="True")]
We could iterate these in order. If keys are duplicated, the last one where all when
s evaluate to true takes precedence.
If for a given key, none of the configs evaluate to true, we just would not add that key to the status. And to prevent stale status fields, we would clear out the custom status fields and re-evaluate them from scratch on every reconciliation.
So if you wanted to support ready: Unknown
, you could do:
statusConfig:
- key: "ready"
value: "Unknown"
- key: "ready"
value: "True"
when:
- version: v1alpha1
group: serving.knative.dev
kind: Service
jsonPath: $.status.conditions[?(@.type=="Ready" && @.status=="True")]
- version: v1beta1
group: extensions
kind: Deployment
jsonPath: $.status.conditions[?(@.type=="Availability" && @.status=="True")]
I like this format. It is intuitive to read (above all the when
), and indeed, generic enough. Good finding.
We would still need an outer loop that iterates the readinessConfigs to come up with the the final value though right?
I am not sure. If you have a map with dependent resources' groupVersionKinds as key, you just have to look for this key. The map can be built with the watches
init, so at operator startup. I won't slow the operations then.
If I have understood correctly, the status change will trigger a call of the function getCRReadiness
with the dependent resource as attribute obj *unstructured.Unstructured
, correct ?
The dependent resource will so have the jsonPath to get or nil directly from the map, without adding a loop.
If I have understood correctly, the status change will trigger a call of the function getCRReadiness with the dependent resource as attribute obj *unstructured.Unstructured, correct ? The dependent resource will so have the jsonPath to get or nil directly from the map, without adding a loop.
Ah no, not quite. When a dependent resource changes, the watch will trigger a reconciliation of the parent CR (not the dependent resource), which is what we want because we need to update the CR's status. So we need to look at all of the jsonpaths for all of the dependent resources every time we reconcile the parent CR to make sure we have the right status value. So we'll have to loop over every status config and when conditions during the parent CR reconciliation.
Yeah, sorry, you had already explained it to me. I am always confused because we can identify the status change in the predicate, but we cannot send it back to reconcile method.
So, IIUC, in Reconcile
method, the reconciled CR is aware of its dependent resources, correct ?
So it means we can loop over its dependent resources, in Reconcile
:
if manager.IsUpdateRequired() {
[...]
_, dr := range o.getDependentRessources() {
status.SetCondition(types.HelmAppCondition{
Type: getStatusKey(),
Status: getCRReadiness(dr, readinessConfigs),
})
_ = r.updateResourceStatus(o, status)
}
Then the getCRReadiness
can use the previous map. So I still think we do not need to loop over the jsonPaths.
Close, but a couple of tweaks:
status := o.Object["status"]
_, dr := range o.getDependentRessources() {
status[getStatusKey()] = getStatusValue(dr, statusConfigs)
}
_ = r.updateResourceStatus(o, status)
There's probably some refactoring to do to support this. For example, we need to add something like a CustomFields map[string]string
value to the status struct and make sure it marshals the custom fields correctly.
We should also probably refactor the reconcile method so that when install, upgrade, and reconcile complete, they all fall through to some common code so that we don't have to repeat these in each section.
Maybe a defer
to update all the conditions (ConditionReleaseFailed
, ConditionDeployed
, ConditionIrreconcilable
, ConditionDependentResourcesReady
).
It would allow to set the conditions to their default value (so for instance, ConditionDeployed=false
), and reduce the code in reconcile
.
All the conditions will be displayed in the yaml, which may help the user to implement some tests from external program or add some custom columns to CRD.
So we can move r.updateResourceStatus(o, status)
as defer
and the various status.SetCondition
can be moved to simple methods so as to reduce the code in Reconcile
like :
func (s *HelmAppStatus) SetDeployedCondition(value types.ConditionStatus, reason type.HelmAppConditionReason, msg string) {
s.SetCondition(types.HelmAppCondition{
Type: types.ConditionDeployed,
Status: value,
Reason: reason,
Message: msg,
})
}
func (s *HelmAppStatus) SetReadyCondition(value types.ConditionStatus) {
s.SetCondition(types.HelmAppCondition{
Type: types.ConditionDependentResourcesReady,
Status: value,
})
}
func (r HelmOperatorReconciler) Reconcile(request reconcile.reco) (reconcile.Result, error) {
o := &unstructured.Unstructured{}
...
status := types.StatusFor(o)
defer r.updateResourceStatus(o, status)
...
if err != nil {
log.Error(err, "Release failed")
status.SetReleaseFailedCondition(types.StatusTrue, types.ReasonInstallError, err.Error())
return reconcile.Result{}, err
}
status.SetReleaseFailedCondition(types.StatusFalse, nil, "")
...
status.SetDeployedCondition(types.StatusTrue, types.ReasonInstallSuccessful, message)
...
status.SetReadyCondition(types.StatusTrue)
At return of Reconcile
, the defer will update the resource with all conditions and DeployedRelease
. The RemoveCondition
can be removed in favor of setting the condition to false
.
Sorry, I haven't forgotten about this. I was out for a month. I've added this to the v1.0.0 milestone so we make sure to get this in.
I've been working on a major refactoring of the Helm Operator here that will likely coincide with the 1.0.0 release, so I'll try some of these ideas out there as I find time.
Or if you're interested in working on this, I would definitely accept PRs there and help you out.
/lifecycle frozen
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
/lifecycle frozen
Is there any solution for such purposes by using helm based operators?
I think something like this is important, as an equivalent to helm upgrade --wait --wait-for-jobs
.
The way I could see this working is that the helm-operator would watch all the resources to which it has attached an ownerReference
for changes to their .status.conditions
field looking for a condition that indicates the resource is ready.
By default, I guess it would look for a condition with type: Ready
, but for known resources that are part of the core APIs it could watch for different conditions, e.g. for jobs it would look for type: Complete
and for deployments it would be type: Available
.
That could then be reported as an additional condition on the CRD being managed by helm-operator
, e.g. type: Ready
, alongside the actual Helm release condition (type: Deployed
).
Hopefully there would be some logic from the Helm wait code that we could reuse to determine when resources of different kinds are ready, but totally appreciate that we don't want a blocking operation as part of the Helm release procedure. That would not fit with the pattern of a CRD with a reconciliation loop. Hence why helm-operator
must take responsibility for it.
We need to look into this issue to check if this has been resolved in the recent versions or if implementing this feature is a requirement given there are no workarounds.
Feature Request
I have recently discovered the helm-operator. I really like the simplicity to maintain the definitions of releases and how the operator watches the parent definition as well as the dependent resources and have started migrating some of our deployments to it.
As a status, the operator returns the status of the deployment itself in
status.Conditions
as well as thestatus.deployedRelease
name and manifest.I find missing a way to add to custom columns the status (or any other field) from the dependent resources if
watchDependentResources
istrue
.For example, I was thinking about having an optional field to
watches.yaml
, such as :This indicates to add a field
status.podReadiness
to CRD, having the value ofReady
condition from child Pod.This would allow the CRD to add columns like :
and therefore have the statuses of the dependent resources in the
get
operation of the CR.I can't estimate the complexity of the implementation yet.
As a
Watch
is already registered on the dependent resources, I guess the feasibility of the change depends on a status change being catchable byDependentPredicateFuncs
. Does a status change triggerUpdateEvent
?