Closed fhalde closed 2 months ago
Any suggestions on how to even debug this? Thanks!
This is happening on every reconciliation loop so it's not like the system is reaching a stable state
Eventually the operator gives up
Exhausted retries for scope
@fhalde this seems to happen during the create operation, are you sure you don't use some label selectors? So the informer is not selecting the target resource?
Best would be if you could create a minimalistic reproducer, I will take a look. You can debug in dependent resource create function: https://github.com/operator-framework/java-operator-sdk/blob/a08d725a671136efe7270aab1e368f7155c10338/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/dependent/kubernetes/KubernetesDependentResource.java#L133-L133
Note that there there is a flag for dependents: https://github.com/operator-framework/java-operator-sdk/blob/a08d725a671136efe7270aab1e368f7155c10338/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/dependent/kubernetes/KubernetesDependent.java#L75-L75
What is by default true
, that will create the resource only if not exists. This will probably make the problem go away, but it is still an issue why is it not in the cache, what is strange.
Note that this flag is true
by default, since when there is a depedent resource what is a creator
it kinda implies that it should really create the resource, not to compete on creation with other controllers. But this might be a too strong assumption and might be false.
However it can happen that create is called and the resource already created only in case that these are happening very close in time, thus other controller created the resource, and it is not in you controllers cache yet.
Again, best is to create a simple reproducer, so we can check.
Hi @csviri , our operator has reached to a state where a simple reproducer will be very hard to prepare unfortunately. It has a lot of our internal org specific logic
we use label selectors each of our dependent is annotated with
@KubernetesDependent(labelSelector = "app.kubernetes.io/managed-by=platform-reconciler")
and we add these labels to the resources it creates. I can see the labels on the resources
Do label selectors work if a value was not specified? i.e. @KubernetesDependent(labelSelector = "app.kubernetes.io/managed"). When adding labels, we set the value to "true"
Can you try it without the label selector?
@KubernetesDependent(labelSelector = "app.kubernetes.io/managed"). When adding labels, we set the value to "true"
yes, that should be fine AFAIK
ok let me try. first, i tried simply restarting the operator and that didn't help. 409 conflicts are still happening. i'll now remove the label selector and see
some context
public class ConnectorBulkDependentResource extends CRUDKubernetesDependentResource<KafkaConnector, CR> implements BulkDependentResource<KafkaConnector, CR>
I restarted the operator thrice now and it always seem to have a problem with the same dependent object. We end up creating 20 KafkaConnectors using the BulkDependentResource abstraction and it only has a 409 problem with just 1 of the 20 dependents, every time it seems to be the same object
Even after removing the label selector, the 409 conflict is happening. hmm
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://x.x.x.x:443/apis/kafka.strimzi.io/v1beta2/namespaces/platforms/kafkaconnectors/indigo-s19?fieldManager=source-connector-reconciler&force=true. Message: Operation cannot be fulfilled on kafkaconnectors.kafka.strimzi.io "indigo-s19": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=kafka.strimzi.io, kind=kafkaconnectors, name=indigo-s19, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on kafkaconnectors.kafka.strimzi.io "indigo-s19": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
once again, the same "indigo-s19" object as i mentioned
i simply deleted "indigo-s19" object and now the conflict is happening on "indigo-s8" object. there were no logs related to conflicts before on "indigo-s8".
pls put here the debug logs of JOSDK, but without a reproducer might be hard to find the issue. We most need that you create one, ideally on simple objects.
not sure what debug logs would you be interested in, but here you go. it's filtered out for the indigo-s19 object that is having conflicts at the moment. this is right after the startup
2024-04-24 11:11:45,010 i.j.o.p.e.s.i.InformerEventSource [DEBUG] On add event received for resource id: ResourceID{name='indigo-s19', namespace='xyz'} type: KafkaConnector version: 35085376
2024-04-24 11:11:45,010 i.j.o.p.e.s.i.InformerEventSource [DEBUG] Propagating event for ADD, resource with same version not result of a reconciliation. Resource ID: ResourceID{name='indigo-s19', namespace='xyz'}
2024-04-24 11:11:45,400 i.j.o.p.e.s.i.InformerEventSource [DEBUG] On add event received for resource id: ResourceID{name='indigo-s19', namespace='xyz'} type: KafkaConnector version: 35085376
2024-04-24 11:11:45,401 i.j.o.p.e.s.i.InformerEventSource [DEBUG] Propagating event for ADD, resource with same version not result of a reconciliation. Resource ID: ResourceID{name='indigo-s19', namespace='xyz'}
2024-04-24 11:11:47,219 i.j.o.p.e.s.i.InformerEventSource [DEBUG] Using PrimaryToSecondaryIndex to find secondary resources for primary: ResourceID{name='indigo', namespace='xyz'}. Found secondary ids: [ResourceID{name='indigo-s11', namespace='xyz'}, ResourceID{name='indigo-s10', namespace='xyz'}, ResourceID{name='indigo-s9', namespace='xyz'}, ResourceID{name='indigo-s4', namespace='xyz'}, ResourceID{name='indigo-s3', namespace='xyz'}, ResourceID{name='indigo-s2', namespace='xyz'}, ResourceID{name='indigo-s1', namespace='xyz'}, ResourceID{name='indigo-s8', namespace='xyz'}, ResourceID{name='indigo-s7', namespace='xyz'}, ResourceID{name='indigo-s6', namespace='xyz'}, ResourceID{name='indigo-s5', namespace='xyz'}, ResourceID{name='indigo-s0', namespace='xyz'}, ResourceID{name='indigo-s17', namespace='xyz'}, ResourceID{name='indigo-s16', namespace='xyz'}, ResourceID{name='indigo-s19', namespace='xyz'}, ResourceID{name='indigo-s18', namespace='xyz'}, ResourceID{name='indigo-s13', namespace='xyz'}, ResourceID{name='indigo-s12', namespace='xyz'}, ResourceID{name='indigo-s15', namespace='xyz'}, ResourceID{name='indigo-s14', namespace='xyz'}]
2024-04-24 11:11:47,282 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='indigo-s19', namespace='xyz'}
2024-04-24 11:11:47,283 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource found in cache: true for id: ResourceID{name='indigo-s19', namespace='xyz'}
2024-04-24 11:11:47,288 i.j.o.p.d.AbstractDependentResource [DEBUG] Creating 'indigo-s19' KafkaConnector for primary ResourceID{name='indigo', namespace='xyz'}
2024-04-24 11:11:47,288 i.j.o.p.d.k.KubernetesDependentResource [DEBUG] Creating target resource with type: class io.strimzi.api.kafka.model.connector.KafkaConnector, with id: ResourceID{name='indigo-s19', namespace='xyz'}
- nu.avalanche.operator.dbsourceconnector.dependentresources.ConnectorBulkDependentResource -> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://x.x.x.x:443/apis/kafka.strimzi.io/v1beta2/namespaces/xyz/kafkaconnectors/indigo-s19?fieldManager=db-source-connector-reconciler&force=true. Message: Operation cannot be fulfilled on kafkaconnectors.kafka.strimzi.io "indigo-s19": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=kafka.strimzi.io, kind=kafkaconnectors, name=indigo-s19, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on kafkaconnectors.kafka.strimzi.io "indigo-s19": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
The logs are pretty much the same for other indigo-$NUMBER child objects except for the part where it tries to create. it only tries to create the indigo-s19 one for some reason
2024-04-24 11:11:47,288 i.j.o.p.d.k.KubernetesDependentResource [DEBUG] Creating target resource with type: class io.strimzi.api.kafka.model.connector.KafkaConnector, with id: ResourceID{name='indigo-s19', namespace='xyz'}
^ this log is not present for other child objects
btw, what is the caching key the fabric8 informer uses to cache objects?
i had the chance of attaching a debugger on https://github.com/operator-framework/java-operator-sdk/blob/3c863793fdd2341d434bdfcfd6216b4023554a7b/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/dependent/BulkDependentResourceReconciler.java#L26
final var desiredResources = bulkDependentResource.desiredResources(primary, context);
had 20 entries as expected
{indigo-s19=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s9=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s8=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s5=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s4=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s7=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s6=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s1=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s0=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s3=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s2=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s10=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s11=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s12=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s13=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s14=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s15=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s16=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s17=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null), indigo-s18=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null)}
{indigo-s11=KafkaConnector(additionalProperties={}, apiVersion=null, kind=null)}
whereas
Map<String, R> actualResources = bulkDependentResource.getSecondaryResources(primary, context);
just had 1 entry in the map
indigo-s19 KafkaConnector(additionalProperties={}, apiVersion=null, kind=null)
idk why the apiversion and kind is null though
understanding why
Map<String, R> actualResources = bulkDependentResource.getSecondaryResources(primary, context);
just has 1 entry would explain this behaviour IMO
Yes, that seems to be right, can you debug it, why no other resources are there / returned ? are you overriding that method?
yes we are overriding that method. i think we have to implement that method because intellij is complaining otherwise this is how we've implemented it though
@Override
public Map<String, KafkaConnector> getSecondaryResources(
DatomicSource primary, Context<DatomicSource> context) {
return context
.getSecondaryResourcesAsStream(KafkaConnector.class)
.collect(
Collectors.toMap(ConnectorBulkDependentResource::resourceName, Function.identity()));
}
ConnectorBulkDependentResource::resourceName = resource.getMetadata().getName()
we ultimately make use of the context.getSecondaryResourcesAsStream so this shouldn't be a problem. will continue, do you see any problems?
i can confirm it is not our implementation that is the problem. context.getSecondaryResources
or context.getSecondaryResourcesAsStream
itself is returning a single entry. weird
So that is working same as in any cases, based on owner reference by default the resources are selected from the cache. Can you check if the resources you create are actually in informer cache?
how can i check that? just so you know this is some of the logs
2024-04-24 11:11:47,219 i.j.o.p.e.s.i.InformerEventSource [DEBUG] Using PrimaryToSecondaryIndex to find secondary resources for primary: ResourceID{name='indigo', namespace='xyz'}. Found secondary ids: [ResourceID{name='indigo-s11', namespace='xyz'}, ResourceID{name='indigo-s10', namespace='xyz'}, ResourceID{name='indigo-s9', namespace='xyz'}, ResourceID{name='indigo-s4', namespace='xyz'}, ResourceID{name='indigo-s3', namespace='xyz'}, ResourceID{name='indigo-s2', namespace='xyz'}, ResourceID{name='indigo-s1', namespace='xyz'}, ResourceID{name='indigo-s8', namespace='xyz'}, ResourceID{name='indigo-s7', namespace='xyz'}, ResourceID{name='indigo-s6', namespace='xyz'}, ResourceID{name='indigo-s5', namespace='xyz'}, ResourceID{name='indigo-s0', namespace='xyz'}, ResourceID{name='indigo-s17', namespace='xyz'}, ResourceID{name='indigo-s16', namespace='xyz'}, ResourceID{name='indigo-s19', namespace='xyz'}, ResourceID{name='indigo-s18', namespace='xyz'}, ResourceID{name='indigo-s13', namespace='xyz'}, ResourceID{name='indigo-s12', namespace='xyz'}, ResourceID{name='indigo-s15', namespace='xyz'}, ResourceID{name='indigo-s14', namespace='xyz'}]
2024-04-24 11:11:47,282 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='indigo-s19', namespace='xyz'}
2024-04-24 11:11:47,283 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource found in cache: true for id: ResourceID{name='indigo-s19', namespace='xyz'}
Easyies is to set a debug into create in dependent resources, then
context.controller.eventSourceManager.eventSources
- here find you informer. In Informer there is itemStore insides where is the cache.
Can you show also how the primaryToSecondary mapper I implemented?
Also you can just debug why the getSecondayResources
call, and see why the others are not returned.
we don't implement the primaryToSecondary
mapper for this reconciler. what's different about bulk dependent? I also didn't come across anything in the docs
do you know why these nulls exist?
indigo-s19 KafkaConnector(additionalProperties={}, apiVersion=null, kind=null)
KafkaConnector is a custom resource we don't own. do we have to make kubernetes client aware of this object kind?
Ahh sorry my mistake, misread it
how can i check that? just so you know this is some of the logs
2024-04-24 11:11:47,219 i.j.o.p.e.s.i.InformerEventSource [DEBUG] Using PrimaryToSecondaryIndex to find secondary resources for primary: ResourceID{name='indigo', namespace='xyz'}. Found secondary ids: [ResourceID{name='indigo-s11', namespace='xyz'}, ResourceID{name='indigo-s10', namespace='xyz'}, ResourceID{name='indigo-s9', namespace='xyz'}, ResourceID{name='indigo-s4', namespace='xyz'}, ResourceID{name='indigo-s3', namespace='xyz'}, ResourceID{name='indigo-s2', namespace='xyz'}, ResourceID{name='indigo-s1', namespace='xyz'}, ResourceID{name='indigo-s8', namespace='xyz'}, ResourceID{name='indigo-s7', namespace='xyz'}, ResourceID{name='indigo-s6', namespace='xyz'}, ResourceID{name='indigo-s5', namespace='xyz'}, ResourceID{name='indigo-s0', namespace='xyz'}, ResourceID{name='indigo-s17', namespace='xyz'}, ResourceID{name='indigo-s16', namespace='xyz'}, ResourceID{name='indigo-s19', namespace='xyz'}, ResourceID{name='indigo-s18', namespace='xyz'}, ResourceID{name='indigo-s13', namespace='xyz'}, ResourceID{name='indigo-s12', namespace='xyz'}, ResourceID{name='indigo-s15', namespace='xyz'}, ResourceID{name='indigo-s14', namespace='xyz'}] 2024-04-24 11:11:47,282 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='indigo-s19', namespace='xyz'} 2024-04-24 11:11:47,283 i.j.o.p.e.s.i.ManagedInformerEventSource [DEBUG] Resource found in cache: true for id: ResourceID{name='indigo-s19', namespace='xyz'}
It seems that the primary to secondary index actually finds the resources, that is very weird.
Pls create a simple reproducer, or try to debug it, it is very hard to guess anything based on the info we have here.
do you know why these nulls exist?
indigo-s19 KafkaConnector(additionalProperties={}, apiVersion=null, kind=null)
expected? KafkaConnector is a custom resource we don't own. do we have to make kubernetes client aware of this object kind?
That is not expected at all.
Just so you know, KafkaConnector is not a CRD we own. We have a dependency on io.strimzi:api
library that contains the CustomResource
definition. It's an external CRD hence i was trying to understand if the k8s client needs to be made aware of it
What's the recommendation when we have to work with other CRD's that are not from the core k8s apiGroup?
There is a CRD -> JAVA generator in fabric8 if you don't have a class representation of the CRD. You can use that to generate the classes. If that would not work some reason, GenericKubernetesResource
can be used for any resources, not JOSDK supports that also nicely.
Ok, reproduced. It seems to be happening for external crd's only. In this case KafkaConnector
https://github.com/fhalde/operator-pg
The same bulk dependent worked fine when the bulk resource was changed to ConfigMap ( a k8s native object )
@csviri
Ahh, good, so if this is an issue with the generator, we should address it in fabric8 client. Are you generating classes from CRD?
could you explain what seems to be the problem? like i said, we don't own those CRD & classes. Strimzi has a maven artifact published which contains their CustomResource classes and we just imported them. They may have used the fabric8 generator to generate those classes
We can try to generate classes from their CRD directly, but could you first explain to me what was going on? And how could this be fixed? or alternatives?
@csviri
No idea, this is more for fabric8 client, pls create an issue there. https://github.com/fabric8io/kubernetes-client
But this should work since strimzi also uses fabric8 client. (also uses JOSDK on sub-operators)
Unexpected null values in Kubernetes objects very often indicate an issue with deserialization. Maybe your KafkaConnector
objects are not getting properly deserialized for some reason?
Yes, that will the problem, strange that it happens on classes that are already using fabric8, maybe there are some customizations in Strimzi.
cc @manusa @shawkins @rohanKanojia maybe you know something about this
This might be a consequence of different versions of the fabric8 client being used…
Thanks, we are already attempting to generate the classes ourselves from their CRD's 👍
Bug Report
What did you do?
Recently we reduced our operator reconciliation interval to 5minutes
What did you expect to see?
No conflicts because the dependents have not changed at all. I tried to keep track of the resource versions of my resources and when the 409 logs occur. i didn't see any bumps in the resource version.
What did you see instead? Under which circumstances?
Environment
Kubernetes cluster type: EKS
$ Mention java-operator-sdk version from pom.xml file
4.8.2 fabric8 6.11.0$ java -version
openjdk version "21.0.2" 2024-01-16 LTS$ kubectl version
Client Version: v1.29.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.1-eks-b9c9ed7Possible Solution
Additional context
We haven't tuned the operator configuration in any way ( everything is as default e.g. useSSA e.t.c )