Open AleksanderBrzozowski opened 8 months ago
Hi @AleksanderBrzozowski , thx for the issue,
Make sure that the finalizer is applied first before creating secondary resources
Actually, this should be the case. Will take a look.
Yeah, I saw this code part. But it seems that in some cases it doesn't work correctly. Have you tried running the test from my sample repo? π
reproduce it, seems that the cleanup is not called when it fails (at least in my case), while the finalizer is added.
If you log the resource in the reconciler the finalizer is always there.
Notes: it theoretically can happen that the ConfigMap is created but on cleanup is not in the cache yet, so it will not be deleted (this is a limitation and issue with this method - when resources are in different NS), but dependent resource mitigate / eliminate this issue, so in this : https://javaoperatorsdk.io/docs/dependent-resources#caching-and-event-handling-in-kubernetesdependentresource
So event this should not be the case.
Not sure that I follow π
So I guess that you were able to reproduce the issue by running a test?
Are you saying that we can somehow fix this issue by modifying the code? The ConfigMapDependentResource
already extends KubernetesDependentResource
class π
Are you saying that we can somehow fix this issue by modifying the code? The ConfigMapDependentResource already extends KubernetesDependentResource class π
No, not saying that :) still investigating, will probably take a while
No need to rush. We don't have this problem in production, we only discovered it when running integration tests, and we were able to solve it π
interesting is that I created a reproducer now in pure Java and not using Quarkus and not able to reproduce there: https://github.com/csviri/race-condition-reproducer
The problem seems to be that there is a Delete
event received at some point. Which should not be there, this could be event from the previous test case. This si received while the reconciliation is running. But since the finalizer is not removed by the test reconciler for sure not sure how this can happen. (Maybe from previous test runs? - on the first check this does not seems to be the case)
The problem seems to be with either k8s API or informer:
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-3) Event received for resource: ResourceID{name='test', namespace='default'} version: 3156 uuid: cd21c403-f3e4-4830-8efe-a597f77155c1 action: DELETED
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-3) Event Old resource: null,
new resource: CustomResource{kind='MyCustomResource', apiVersion='example.com/v1alpha1', metadata=ObjectMeta(annotations={}, creationTimestamp=2024-03-04T13:32:33Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[mycustomresources.example.com/finalizer], generateName=null, generation=1, labels={}, managedFields=[], name=test, namespace=default, ownerReferences=[], resourceVersion=3156, selfLink=null, uid=cd21c403-f3e4-4830-8efe-a597f77155c1, additionalProperties={}), spec=null, status=null}
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-3) Received event: ResourceEvent{action=DELETED, associated resource id=ResourceID{name='test', namespace='default'}}
As can be seen here we receive a delete event for a resource that still has a finalizer.
cc @shawkins @manusa @metacosm
Full log of a failing test case:
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-3) Event received for resource: ResourceID{name='test', namespace='default'} version: 3154 uuid: cd21c403-f3e4-4830-8efe-a597f77155c1 action: ADDED
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-3) Event Old resource: null,
new resource: CustomResource{kind='MyCustomResource', apiVersion='example.com/v1alpha1', metadata=ObjectMeta(annotations={}, creationTimestamp=2024-03-04T13:32:33Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=1, labels={}, managedFields=[], name=test, namespace=default, ownerReferences=[], resourceVersion=3154, selfLink=null, uid=cd21c403-f3e4-4830-8efe-a597f77155c1, additionalProperties={}), spec=null, status=null}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-3) Received event: ResourceEvent{action=ADDED, associated resource id=ResourceID{name='test', namespace='default'}}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-3) Marking event received for: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (-1660409789-pool-11-thread-3) Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (-1660409789-pool-11-thread-3) Resource found in cache: true for id: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-3) Executing events for custom resource. Scope: ExecutionScope{resource: null}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (pool-14-thread-6) Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (pool-14-thread-6) Resource found in cache: true for id: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.ReconciliationDispatcher] (ReconcilerExecutor-mycustomresourcereconciler-139) Handling dispatch for resource test
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.ReconciliationDispatcher] (ReconcilerExecutor-mycustomresourcereconciler-139) Adding finalizer for resource: cd21c403-f3e4-4830-8efe-a597f77155c1 version: 3154
2024-03-04 14:32:33,831 DEBUG [io.jav.ope.pro.eve.ReconciliationDispatcher] (ReconcilerExecutor-mycustomresourcereconciler-139) Conflict retying update for: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,832 DEBUG [io.jav.ope.pro.eve.ReconciliationDispatcher] (ReconcilerExecutor-mycustomresourcereconciler-139) Trying to replace resource ResourceID{name='test', namespace='default'}, version: 3154
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-4) Event received for resource: ResourceID{name='test', namespace='default'} version: 3155 uuid: cd21c403-f3e4-4830-8efe-a597f77155c1 action: UPDATED
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.EventProcessor] (ReconcilerExecutor-mycustomresourcereconciler-139) Event processing finished. Scope: ExecutionScope{ resource id: ResourceID{name='test', namespace='default'}, version: 3154}, PostExecutionControl: PostExecutionControl{onlyFinalizerHandled=false, updatedCustomResource=CustomResource{kind='MyCustomResource', apiVersion='example.com/v1alpha1', metadata=ObjectMeta(annotations={}, creationTimestamp=2024-03-04T13:32:33Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[mycustomresources.example.com/finalizer], generateName=null, generation=1, labels={}, managedFields=[], name=test, namespace=default, ownerReferences=[], resourceVersion=3155, selfLink=null, uid=cd21c403-f3e4-4830-8efe-a597f77155c1, additionalProperties={}), spec=null, status=null}, runtimeException=null}
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.EventProcessor] (ReconcilerExecutor-mycustomresourcereconciler-139) Cleanup for successful execution for resource: test
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (ReconcilerExecutor-mycustomresourcereconciler-139) Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-4) Event Old resource: CustomResource{kind='MyCustomResource', apiVersion='example.com/v1alpha1', metadata=ObjectMeta(annotations={}, creationTimestamp=2024-03-04T13:32:33Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=1, labels={}, managedFields=[], name=test, namespace=default, ownerReferences=[], resourceVersion=3154, selfLink=null, uid=cd21c403-f3e4-4830-8efe-a597f77155c1, additionalProperties={}), spec=null, status=null},
new resource: CustomResource{kind='MyCustomResource', apiVersion='example.com/v1alpha1', metadata=ObjectMeta(annotations={}, creationTimestamp=2024-03-04T13:32:33Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[mycustomresources.example.com/finalizer], generateName=null, generation=1, labels={}, managedFields=[], name=test, namespace=default, ownerReferences=[], resourceVersion=3155, selfLink=null, uid=cd21c403-f3e4-4830-8efe-a597f77155c1, additionalProperties={}), spec=null, status=null}
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (ReconcilerExecutor-mycustomresourcereconciler-139) Resource found in cache: true for id: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.EventProcessor] (ReconcilerExecutor-mycustomresourcereconciler-139) Rescheduling event for max reconciliation interval for resource: ResourceID{name='test', namespace='default'} : with delay: 36000000
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-4) Received event: ResourceEvent{action=UPDATED, associated resource id=ResourceID{name='test', namespace='default'}}
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-4) Marking event received for: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,840 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (-1660409789-pool-11-thread-4) Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (-1660409789-pool-11-thread-4) Resource found in cache: true for id: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-4) Executing events for custom resource. Scope: ExecutionScope{resource: null}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (pool-14-thread-7) Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (pool-14-thread-7) Resource found in cache: true for id: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.eve.ReconciliationDispatcher] (ReconcilerExecutor-mycustomresourcereconciler-141) Handling dispatch for resource test
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.eve.ReconciliationDispatcher] (ReconcilerExecutor-mycustomresourcereconciler-141) Reconciling resource test with version: 3155 with execution scope: ExecutionScope{ resource id: ResourceID{name='test', namespace='default'}, version: 3155}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.dep.wor.WorkflowReconcileExecutor] (ReconcilerExecutor-mycustomresourcereconciler-141) Submitting for reconcile: DependentResourceNode{ConfigMapDependentResource@6bff2a08} primaryID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.dep.wor.WorkflowReconcileExecutor] (ReconcilerExecutor-mycustomresourcereconciler-141) Submitted to reconcile: DependentResourceNode{ConfigMapDependentResource@6bff2a08} primaryID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.dep.wor.WorkflowReconcileExecutor] (pool-15-thread-8) Reconciling for primary: ResourceID{name='test', namespace='default'} node: DependentResourceNode{ConfigMapDependentResource@6bff2a08}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.eve.sou.inf.InformerEventSource] (pool-15-thread-8) Using PrimaryToSecondaryIndex to find secondary resources for primary: ResourceID{name='test', namespace='default'}. Found secondary ids: []
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.dep.AbstractDependentResource] (pool-15-thread-8) Creating 'test' ConfigMap for primary ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,841 DEBUG [io.jav.ope.pro.dep.kub.KubernetesDependentResource] (pool-15-thread-8) Creating target resource with type: class io.fabric8.kubernetes.api.model.ConfigMap, with id: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-3) Event received for resource: ResourceID{name='test', namespace='default'} version: 3156 uuid: cd21c403-f3e4-4830-8efe-a597f77155c1 action: DELETED
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.sou.con.ControllerResourceEventSource] (-1660409789-pool-11-thread-3) Event Old resource: null,
new resource: CustomResource{kind='MyCustomResource', apiVersion='example.com/v1alpha1', metadata=ObjectMeta(annotations={}, creationTimestamp=2024-03-04T13:32:33Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[mycustomresources.example.com/finalizer], generateName=null, generation=1, labels={}, managedFields=[], name=test, namespace=default, ownerReferences=[], resourceVersion=3156, selfLink=null, uid=cd21c403-f3e4-4830-8efe-a597f77155c1, additionalProperties={}), spec=null, status=null}
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-3) Received event: ResourceEvent{action=DELETED, associated resource id=ResourceID{name='test', namespace='default'}}
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-3) Marking delete event received for: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,843 DEBUG [io.jav.ope.pro.eve.EventProcessor] (-1660409789-pool-11-thread-3) Cleaning up for delete event for: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,844 DEBUG [io.jav.ope.pro.eve.sou.inf.InformerEventSource] (-1660409789-pool-11-thread-4) On add event received for resource id: ResourceID{name='test', namespace='default'} type: ConfigMap version: 3157
2024-03-04 14:32:33,845 DEBUG [io.jav.ope.pro.eve.sou.inf.InformerEventSource] (-1660409789-pool-11-thread-4) Skipping event propagation for ADD, since was a result of a reconcile action. Resource ID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,845 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (pool-15-thread-8) Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,845 DEBUG [io.jav.ope.pro.eve.sou.inf.ManagedInformerEventSource] (pool-15-thread-8) Resource found in cache: true for id: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,845 DEBUG [io.jav.ope.pro.dep.wor.WorkflowReconcileExecutor] (pool-15-thread-8) Setting already reconciled for: DependentResourceNode{ConfigMapDependentResource@6bff2a08} primaryID: ResourceID{name='test', namespace='default'}
2024-03-04 14:32:33,845 INFO [MyCustomResourceReconciler] (ReconcilerExecutor-mycustomresourcereconciler-141) Reconciled. CustomResource{kind='MyCustomResource', apiVersion='example.com/v1alpha1', metadata=ObjectMeta(annotations={}, creationTimestamp=2024-03-04T13:32:33Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[mycustomresources.example.com/finalizer], generateName=null, generation=1, labels={}, managedFields=[], name=test, namespace=default, ownerReferences=[], resourceVersion=3155, selfLink=null, uid=cd21c403-f3e4-4830-8efe-a597f77155c1, additionalProperties={}), spec=null, status=null}
2024-03-04 14:32:33,845 DEBUG [io.jav.ope.pro.eve.EventProcessor] (ReconcilerExecutor-mycustomresourcereconciler-141) Event processing finished. Scope: ExecutionScope{ resource id: ResourceID{name='test', namespace='default'}, version: 3155}, PostExecutionControl: PostExecutionControl{onlyFinalizerHandled=false, updatedCustomResource=null, runtimeException=null}
2024-03-04 14:32:33,845 DEBUG [io.jav.ope.pro.eve.EventProcessor] (ReconcilerExecutor-mycustomresourcereconciler-141) Cleanup for successful execution for resource: test
2024-03-04 14:32:33,845 DEBUG [io.jav.ope.pro.eve.EventProcessor] (ReconcilerExecutor-mycustomresourcereconciler-141) Rescheduling
Why the test case fails:
Delete events
has special meaning, it cleans up all kind of caches. So the cleaner part not even executed.meanwhile a Delete event received (for a resource with a finalizer π±)
Just to make sure, are there any labels or other filters that the informer is based off of? If that's the case you will see a delete event when any of that no longer matches, not just when the resource no longer exists.
No there are no labels involved. This seems to be a race condition with the API server. If I add in a test waiting while the finalizer is actually added and just then call the delete, it does not fail:
client.resource(resource).create()
await().untilAsserted{
val res = client.resources(MyCustomResource::class.java)
.inNamespace("default")
.withName("test")
.get();
assertEquals(res.metadata.finalizers.size,1)
}
Thread.sleep(5)
// delete the primary, and wait until it is deleted
client.resources(MyCustomResource::class.java)
.inNamespace("default")
.withName("test")
.withTimeout(5, TimeUnit.SECONDS)
.delete()
@AleksanderBrzozowski
Btw AFAIK Thread.sleep(5)
maybe yields but does not wait 5ms - in general sleep under 15ms does not wait.
asked also for clarification on k8s slack: https://kubernetes.slack.com/archives/C0EG7JC6T/p1709561696968339
Bug Report
What did you do?
We have a primary resource, let's call it
MyCustomResource
. The reconciler for this resource creates and maintains aConfigMap
based on theMyCustomResource
. TheConfigMap
is created in a different namespace thanMyCustomResource
, thus we cannot use theownerReference
mechanism to ensure that the resource is deleted when the primary resource is deleted. To do so, we use JOSDK annotations (primary-name
andprimary-namespace
).The issue is that sometimes when the primary resource is deleted very quickly - right after it was created, the secondary resource is not created:
So it seems that the secondary resource is created, even though the primary resource has no finalizer added.
This issue was reproduced with a sample repo. There is a repeated test that fails after several test loops - it is non deterministic.
What did you expect to see?
There are no Orphan secondary resources left, when the Primary is deleted.
What did you see instead? Under which circumstances?
There is an Orphan secondary resource left, that will be never deleted.
Environment
Kubernetes cluster type:
rancher desktop, version: 1.25
$ Mention java-operator-sdk version from pom.xml file
4.8.0
$ java -version
Java Temurin 17.0.6
$ kubectl version
kubectl version doesn't seem to be related, under the hood JOSDK is using fabric8 kubernetes client with version
6.10.0
Possible Solution
Make sure that the finalizer is applied first before creating secondary resources