redhat-cop / namespace-configuration-operator

The namespace-configuration-operator helps keeping configurations related to Users, Groups and Namespaces aligned with one of more policies specified as a CRs
Apache License 2.0
204 stars 55 forks source link

When status update fails on CR, all following enqueued namespaceconfigs are not processed #132

Open fherbert opened 1 year ago

fherbert commented 1 year ago

We have 5 namespaceconfigs. We find that sometimes it takes a long time for all the namespace configurations to be applied to the namespace. We've found a correlation between when the controller fails to update the CR status of the namespaceconfig, any pending namespaceconfig reconciles are not processed until next time a reconcile is triggered. An example log of the CR status not being able to update is below

2022-10-26T21:54:53.240Z    ERROR    enforcing-reconciler    unable to update status for    {"object": {"kind":"NamespaceConfig","apiVersion":"redhatcop.redhat.io/v1alpha1","metadata":{"name":"default-resourcequota",}}, "error": "Operation cannot be fulfilled on namespaceconfigs.redhatcop.redhat.io \"default-resourcequota\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/redhat-cop/namespace-configuration-operator/controllers.(*NamespaceConfigReconciler).Reconcile
    /home/runner/work/namespace-configuration-operator/namespace-configuration-operator/controllers/namespaceconfig_controller.go:127
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.2/pkg/internal/controller/controller.go:298
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.2/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.2/pkg/internal/controller/controller.go:214
2022-10-26T21:54:53.240Z    ERROR    controller-runtime.manager.controller.namespaceconfig    Reconciler error    {"reconciler group": "redhatcop.redhat.io", "reconciler kind": "NamespaceConfig", "name": "default-resourcequota", "namespace": "", "error": "Operation cannot be fulfilled on namespaceconfigs.redhatcop.redhat.io \"default-resourcequota\": the object has been modified; please apply your changes to the latest version and try again"}

Any other namespace configs after default-limitrange are not processed during this reconcile event.

I'm not sure if replacing the ManageSuccess function with ManageSuccessWithRequeue in the namespaceconfig Reconcile function would fix this or not?

It would seem like a bug not to keep processing the rest of the namespaceconfigs when one of them fail to update CR status?

raffaelespazzoli commented 1 year ago

I'm not sure I understand the issue, do you have 5 CRs or do you have a CR with 5 objects in it. If the latter, then the behavior you describe is to be expected. If the former, then it's surprising.

fherbert commented 1 year ago

We have 5 namespaceconfig CR's (each have a couple of objects defined) that apply to the namespaces (matching labels etc). We're not able to reproduce this issue consistently, but when we do, we see the enforcing-reconciler error posted above and no further reconciliation occurs, meaning the remaining namespaceconfigs do not get applied to the namespace as expected. I'd expect the reconciler to requeue given the error returned from ManageSuccess but we can't see that happening.

Just to make things interesting, during normal operation, we do see the error updating CR status and the remaining namespaceconfigs do get processed so I'm thinking there may be something else causing the reconciliation to stop but we can't see anything else in the operator logs.

We have tried annotating and/or labelling the affected namespaces to trigger a reconciliation but the missing namespaceconfig objects don't get applied. Only after another reconciliation is triggered (from new namespace or restarting the operator) do the remaining namespaceconfigs get applied. Could this be a cache issue?

We are running Openshift 4.10.23, NCO v1.2.4

raffaelespazzoli commented 1 year ago

that error is due to a race condition that I was never able to fix. but it is innocuous as usually the controller retries and eventually succeeds. Different CR are processed independently so I don't see how this erro would influence the execution of other CRs. But I think by default controllers are set with parallelism of 1, meaning one CR at a time. Still once a CR completes the reconcile cycle whether it was successful or not, the next should start...

limlengchye commented 1 year ago

But I think by default controllers are set with parallelism of 1, meaning one CR at a time. Still once a CR completes the reconcile cycle whether it was successful or not, the next should start...

Looks like this is the issue we are facing. Some times ( not always ) when a namespace get re-created ( deleted and then created again without wait in between ), we did encounter some of the CRs 'do nothing' after completing their reconcilation cycles. I think this has to do with the race-condition you mentioned and the timing. I could be wrong: I suspect those CRs were yet to complete the reconcilation cycles before the namespace get re-created. As a result, those remaining CRs 'do nothing' as their locked resource managers compare the parameters passed with the list of locked resources their have, and there are no differences , hence 'do nothing' even though the actually locked resources objects were no longer there ... Could this happen?

raffaelespazzoli commented 1 year ago

Looks like this is the issue we are facing. Some times ( not always ) when a namespace get re-created ( deleted and then created again without wait in between ), we did encounter some of the CRs 'do nothing' after completing their reconcilation cycles. I think this has to do with the race-condition you mentioned and the timing. I saw that behavior too. I think that's a different issue and I was not able to so far to troubleshoot it. I have not spent enough time on it yet.

GerbenWelter commented 1 year ago

I'm currently facing a very similar problem. We use a NamespaceConfig to create a set of 8 NetworkPolicies upon Namespace creation. This works without fail. But the CR also has a matchExpressions so we can annotate a Namespace in which we don't want the NetworkPolicies deployed. When the annotation is applied only 3 of the 8 NetworkPolicies get removed. This happens consistently. The controller-manager pod logs a couple of the following messages:

"Operation cannot be fulfilled on namespaceconfigs.redhatcop.redhat.io \"networkpolicies\": the object has been modified; please apply your changes to the latest version and try again"

@raffaelespazzoli mentioned it should resolve itself after some time but in my experience it doesn't. I even let it sit over night and there was no change. Removing the annotation instantly brings back all the NetworkPolicies.