Resources status should be compatible with kstatus

erikgb commented 2 years ago

As a user of tools like FluxCD, I would like Flux to know if/when my HNC resources are fully reconciled and operational (ready/healthy). Flux supports health assessment for all/most in-tree resource types, but it turns out that Flux also can support custom resources - if the resource expose a status compatible with kstatus.

I think this should be fairly easy to implement, and I would be happy to submit a PR to fix it. But we need to decide what to do with the current content of resource statuses? Can we delete something (breaking), or should we avoid breaking changes and keep duplicated status content with the same meaning?

adrianludwin commented 2 years ago

I think our Condition struct is compatible with the standard (see here) so that's a good start.

As for replacing the existing conditions, the issue I have is that HierarchyConfiguration currently has one Condition that means "this entire namespace is in a bad state" (ActivitiesHalted) and another that means "something's gone wrong but it's not a disaster" (BadConfiguration). I think kstatus would lump these both into Failed, and I don't particularly want to lose the existing distinction. But I'm totally on board with adding a kstatus Stalled condition if either of the above are true. (I'd rather not use the Ready=true/false condition since namespaces aren't exactly ever "ready").

Similarly, HNCConfiguration has three conditions: BadConfiguration, OutOfSync and the poorly named NamespaceCondition (there's a problem in a namespace in HNC). Similarly, I think kstatus would map the first two of these to Stalled. The third one is more informational and doesn't have anything to do with the HNCConfiguration objects itself so probably doesn't need a kstatus-compliant condition.

How does that sound?

erikgb commented 2 years ago

I think it could make sense to keep the existing conditions when the HNC resources are not fully reconciled. AFAIK the most important requirements in kstatus seem to be:

Add a new field: status.observedGeneration to be able to detect when the controller is aware of the resource state, and has started working with it.
Any condition (except Ready: true) means that the controller is either working on making the actual state equal to the desired state, or an error trying to do so. We might have to add a "Reconciling" condition if that is not yet present in HNC. Any resource with status.observedGeneration equal to metadata.generation and no conditions, will be considered ready and operational.

adrianludwin commented 2 years ago

I thought kstatus ignored any condition it didn't recognize. The only conditions is recognizes are Ready, Stalled and Reconciling. Is that not correct? (If not, can you point me to the line of code where it looks for those?)

HNC doesn't actually know when anything finishes reconciling except SubnamespaceAnchor, so we could add it there. So adding that would be a very large task, not just a minor API reporting convention change.

I'm ok with adding status.observedGeneration of course.

erikgb commented 2 years ago

The Ready condition is a bit special, and I understand the other conditions/statuses as just recommendations. The most important aspect is noted in the docs:

The conditions defined in the library are designed to adhere to the "abnormal-true" polarity pattern (https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties) , i.e. that conditions are present and with a value of true whenever something unusual happens. So the absence of any conditions means everything is normal. Normal in this situation simply means that the latest observed generation of the resource manifest by the controller have been fully reconciled with the actual state.

adrianludwin commented 2 years ago

Understood, but the actual code (here and here) seems to ignore any condition except one of those three (except for K8s built-in objects).

Anyway, I'd be in favour of adding a Reconciling condition to the Anchor while it's waiting for the subnamespace to be created, and to adding a Stalled condition to the HierarchyConfiguration and HNCConfiguration (I'd also be ok with Ready: False, but it seems like this is less preferred). I don't want to add a Reconciling condition to the *Configuration objects since we don't actually have the ability to set that accurately, at least not right now, and possibly not ever (the observedGeneration trick doesn't help when the reconciler's actually working on a completely different object).

erikgb commented 2 years ago

@adrianludwin I'll like to push this forward. This is what I suggest to do, in the following order (and as separate pull requests):

Introduce status observedGeneration on all HNC custom resources that are watched/reconciled.
Introduce conditions in SubnamespaceAnchor status, since conditions are missing completely now.
Add new Stalled condition to HNCConfiguration and HierarchyConfiguration in case anything is wrong. This will complement the existing conditions - at least for now.

When this is done, I think the resources are compatible with kstatus, and this issue can be closed.

But depending on if/when we can break the current status API, there are two tasks that I would suggest:

Remove the redundant SubnamespaceAnchor status.status, and instead use the suggested sematics in kstatus.
Remove the IMO now redundant existing conditions, and rely on the reason for further details. I might have overlooked something, but there seems to be exactly one condition type for each reason in HNC.

adrianludwin commented 2 years ago

Yes, each reason has exactly one condition type, so I believe they could be removed. The bigger problem is that some of these conditions are basically warnings, and others are errors. Can kstatus represent that? Or can we add a "Warning" condition that gets ignored by kstatus (see my last comment from Feb 7)?

Are you ok with never being able to add a Reconciling condition to HierarchyConfiguration? We have no ability to set or unset that right now, and may never have that ability. E.g. is a namespace finished reconciling when we've guaranteed that the structure is correct, or when all objects have been propagated into it, or when all objects have been propagated out of it? And how do we feasibly implement that?

What does observedGeneration mean in the context of HierarchyConfiguration? An HC may need to be reconciled due to a change in a different namespace so just because observed generation == actual generation doesn't mean that we're in sync. I mean, we can add it to make kstatus and Flux happy, but it won't give the assurances that I think it's supposed to give.

It might be good to collect all this in a Google doc we can iterate on rather than leaving it in a discussion format in this issue. wdyt?

adrianludwin commented 2 years ago

... with all that said, adding all this to the existing API sgtm, and then when we next bump the API to v1beta1, we can drop the custom fields.

erikgb commented 2 years ago

/assign

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

erikgb commented 2 years ago

Still valid, to support CI/CD tools like Flux and Argo.

erikgb commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

erikgb commented 2 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/hierarchical-namespaces/issues/142#issuecomment-1500903227): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / hierarchical-namespaces

Resources status should be compatible with kstatus #142