Operator should trigger the error state of the CR when deserialization fails

andreaTP commented 2 years ago

Bug Report

When the deserialization of the CR fails the operator should go into the error state (eventually retry the reconcile loop and possibly update the status with the error)

What did you do?

An unrecognized field in a CR will cause the operator to fail the deserialization but the operator stays in running state

What did you expect to see?

The Operator would update the error status of the CR or at minimum, it should crash since an unmatched exception has been thrown.

What did you see instead? Under which circumstances?

The Operator should at least go in CrashLoopBackoff.

Environment

Kubernetes cluster type:

minikube

$ Mention java-operator-sdk version from pom.xml file

Quarkus SDK 3.0.7

$ java -version

Java 11

Reproduction

kubectl apply -f https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/18.0.0/kubernetes/keycloaks.k8s.keycloak.org-v1.yml
kubectl apply -f https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/18.0.0/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml
kubectl apply -f https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/18.0.0/kubernetes/kubernetes.yml

and kubectl apply this resource:

apiVersion: k8s.keycloak.org/v2alpha1
kind: Keycloak
metadata:
  name: example-keycloak
spec:
  disableDefaultIngress: false
  hostname: INSECURE-DISABLE
  tlsSecret: INSECURE-DISABLE

Resulting StackTrace:

2022-04-21 15:05:31,315 ERROR [io.fab.kub.cli.dsl.int.AbstractWatchManager] (OkHttp https://10.96.0.1/...) Invalid event type: java.lang.IllegalArgumentException: Failed to deserialize WatchEvent
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.contextAwareWatchEventDeserializer(AbstractWatchManager.java:253)
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.readWatchEvent(AbstractWatchManager.java:259)
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:284)
    at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:68)
    at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$BuilderImpl$1.onMessage(OkHttpWebSocketImpl.java:97)
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:322)
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
    at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "disableDefaultIngress" (class org.keycloak.operator.v2alpha1.crds.KeycloakSpec), not marked as ignorable (7 known properties: "defaultIngressDisabled", "serverConfiguration", "unsupported", "image", "instances", "hostname", "tlsSecret"])
 at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain: org.keycloak.operator.v2alpha1.crds.Keycloak["spec"]->org.keycloak.operator.v2alpha1.crds.KeycloakSpec["disableDefaultIngress"])
    at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61)
    at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:1127)
    at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1989)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1700)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1678)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:319)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:176)
    at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
    at io.fabric8.kubernetes.client.utils.serialization.SettableBeanPropertyDelegate.deserializeAndSet(SettableBeanPropertyDelegate.java:131)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:313)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:176)
    at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:322)
    at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4650)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2831)
    at com.fasterxml.jackson.databind.ObjectMapper.treeToValue(ObjectMapper.java:3295)
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.contextAwareWatchEventDeserializer(AbstractWatchManager.java:248)
    ... 14 more

but the operator is still running.

csviri commented 2 years ago

@andreaTP given the type-safety nature of CRD-s this should basically never happen in general. If there are some additional validations in addition to OpenAPI validation that could be also handled with admission hooks. So there are all there is all the tooling in Kubernetes to make sure such thing never happens. And it should be like this on the cluster.

I assume that this happens when there are multiple resources versions and no conversion hooks in place. I think we had an issue how to deal with this using labels.

Consider also the situation that there is an Operator that manages the CR-s for whole cluster. Let's say there are no proper conversion hooks and/or validation in places, and such an error happens in one namespace, because the owner of that namespaces manages to create a CR that the operator is not able to handle. The operator however should still be able to manage the other custom resources on the cluster (for different namespaces / teams or any other custom resource). So the operator should not stop working in general if such an error happens.

In general I think this is rather bug with the operator related setup rather than an issue with the operator. So again an operator should actually never see this ideally.

But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error. So not sure even about the status update. Since this is not a problem with the reconciliation. Also not sure how we would do an update on a POJO if it cannot be deserialized.

I see the value to notify the users through the status also in this case. But this happens unfortunately outside of reconciliation loop so basically handling such error would require quite specific approach. Will think about that part, and see how it can be done.

andreaTP commented 2 years ago

given the type-safety nature of CRD-s this should basically never happen in general.

Correct, the situation described happens in case of "bugs" or misalignments in between implementations ( crd-generated CR / jackson deserialization in this specific case).

I assume that this happens when there are multiple resources versions and no conversion hooks in place.

In this case, the issue is reproducible with a single version of a single CRD.

But this happens unfortunately outside of reconciliation loop so basically handling such error would require quite specific approach.

I understand this technical limitation, but we can think about triggering a synthetic updateErrorStatus of the Controller when an Exception is thrown by the Informers. I do believe that, for production-grade Operators, we should be able to somehow show(and propagate to the user logic) the fact that something is going wrong and avoid swallowing and silently ignoring the exceptions.

csviri commented 2 years ago

I do believe that, for production-grade Operators, we should be able to somehow show(and propagate to the user logic) the fact that something is going wrong and avoid swallowing and silently ignoring the exceptions.

This is what I meant with this:

But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error.

In my experience this is what you have anyways on clusters or should have. Again not sure if there is an issues with de-serialization / serialization even the updates might not work in general using the POJOs, maybe with raw api with patching. But that again probably would need a specific error handling mechanism for this case.

Correct, the situation described happens in case of "bugs" or misalignments in between implementations ( crd-generated CR / jackson deserialization in this specific case).

Could you please create an issue for fabric8 client? If there is a bug in generator that should be fixed there.

But anyways thx for this bug report!!

It's definitely worth to discuss if we should handle such errors or not, and if how. I will think about it, try to come up with a solution - probably as mentioned with raw API.

Would be good see others opinion @jmrodri @metacosm .

andreaTP commented 2 years ago

But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error.

I understand, still, not exposing any kind of evident issue makes the problem hard to identify and debug (e.g. when a user reports this kind of issue). Another possible idea might be to leverage Kubernetes Events?

Could you please create an issue for fabric8 client? If there is a bug in generator that should be fixed there.

I would say that we can refer to this: https://github.com/fabric8io/kubernetes-client/issues/3681

Happy to hear more feedback / opinions! 🙂

metacosm commented 2 years ago

I'm not sure what the proper solution is in this case but I'm definitely against crashing the operator because that would leave the door open for malicious actors to craft invalid custom resources to take down the operator.

Another possible idea might be to leverage Kubernetes Events?

What do you mean by that?

andreaTP commented 2 years ago

I'm definitely against crashing the operator

Fair, but we should find a way to notify that a problem occurred IMHO.

What do you mean by that?

We can possibly emit an event, possibly on the user CR or, worst case, on the operator Deployment itself containing the relevant information. This way the issue will be easier to spot using commands like kubectl get events instead of having to eyeball on logs.

metacosm commented 2 years ago

What do you mean by that?

We can possibly emit an event, possibly on the user CR or, worst case, on the operator Deployment itself containing the relevant information. This way the issue will be easier to spot using commands like kubectl get events instead of having to eyeball on logs.

That's an interesting idea. I've never used events so I don't have experience with how they're used. However, they do seem short-lived so may be more easily missed than log inspection or alerting via monitoring?

csviri commented 2 years ago

Usually events are also persisted ideally. But are used usually to propagate information about the cluster state. Typically if a pod cannot start for some reason, there will be no logs, so an event is created for example. Or more information about nodes, or the kube proxy etc. I'm would say using them here would be a rather a mis-use, but don't have very strong opinion :)

andreaTP commented 2 years ago

Sorry for the late reply,

@csviri do you have any link regarding the usage of events solely for "cluster state" events? Super interested in understanding this more!

For this specific case I think that this is a decent UX:

a CR is created by the user
an unexpected exception is thrown in the operator informer when trying to access the resource
an event is emitted on the CR marking it as "failed" (or something on this line)

In this way, people checking the CR itself will have the information about why the status is not getting updated.

csviri commented 2 years ago

@csviri do you have any link regarding the usage of events solely for "cluster state" events? Super interested in understanding this more!

I thing there is no single best resource or definition but for example here: https://www.cncf.io/blog/2020/12/10/the-top-kubernetes-apis-for-cloud-native-observability-part-1-the-kubernetes-metrics-service-container-apis-3

csviri commented 2 years ago

So I agree that this is useful to support. Problem is if we are not able to deserialize, we don't even know the resource ID (name + namespaces). But pretty sure there is a way around this too.

andreaTP commented 2 years ago

@csviri instantiating an "untyped" (e.g. using GenericKubernetesResource) Informer might be one way of doing it.

csviri commented 2 years ago

yes, that is one way to approach it.

metacosm commented 2 years ago

Why an informer, though? Couldn't we just deserialise the failed CR with GenericKubernetesResource?

andreaTP commented 2 years ago

The failed CR doesn't reach back to the "user" code when an exception is thrown.

metacosm commented 2 years ago

How would a generic informer work, though? Would that mean having a constantly running informer watching all the resources?

csviri commented 2 years ago

I think what @andreaTP means that, when an error occurs during de-serialization of a resource, we could try to de-serialize it to GenericKubernetesResource. And the error handler could work with that from that point.

metacosm commented 2 years ago

I think what @andreaTP means that, when an error occurs during de-serialization of a resource, we could try to de-serialize it to GenericKubernetesResource. And the error handler could work with that from that point.

That's what I meant by:

Couldn't we just deserialise the failed CR with GenericKubernetesResource?

Though I guess I'm not sure how that would work because, indeed, we don't have access to the deserialisation that the informer does.

andreaTP commented 2 years ago

Implementation wise there might be a few challenges, last time I picked up something along those lines I ended up with this: https://github.com/fabric8io/kubernetes-client/pull/3786

But, at the moment, using that mechanism would need to instantiate 2 informers per resource.

metacosm commented 2 years ago

Implementation wise there might be a few challenges, last time I picked up something along those lines I ended up with this: fabric8io/kubernetes-client#3786

But, at the moment, using that mechanism would need to instantiate 2 informers per resource.

That's what I was afraid of… 😭

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

andreaTP commented 2 years ago

This is expected to be tackled as part of: #1422

csviri commented 2 years ago

This is expected to be tackled as part of: #1422

It's not IMO, this is a separate issue. The if an informer is not able to de-serialize a resource (probably because of a mis configuration or missing conversion hooks) it a completely different problem compared the the case when there is no permission to the resource. In first case we have the resource at hand to handle (maybe with the raw api), in other case don't have any resource.

csviri commented 2 years ago

While I agree that we try to implement this with a callback, for the other we have now a agreed design for the first iteration: https://github.com/java-operator-sdk/java-operator-sdk/issues/1422#issuecomment-1227355076

andreaTP commented 2 years ago

👍 I just tried to disable the stale condition 🙂

operator-framework / java-operator-sdk