Closed andreaTP closed 2 years ago
@andreaTP given the type-safety nature of CRD-s this should basically never happen in general. If there are some additional validations in addition to OpenAPI validation that could be also handled with admission hooks. So there are all there is all the tooling in Kubernetes to make sure such thing never happens. And it should be like this on the cluster.
I assume that this happens when there are multiple resources versions and no conversion hooks in place. I think we had an issue how to deal with this using labels.
Consider also the situation that there is an Operator that manages the CR-s for whole cluster. Let's say there are no proper conversion hooks and/or validation in places, and such an error happens in one namespace, because the owner of that namespaces manages to create a CR that the operator is not able to handle. The operator however should still be able to manage the other custom resources on the cluster (for different namespaces / teams or any other custom resource). So the operator should not stop working in general if such an error happens.
In general I think this is rather bug with the operator related setup rather than an issue with the operator. So again an operator should actually never see this ideally.
But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error. So not sure even about the status update. Since this is not a problem with the reconciliation. Also not sure how we would do an update on a POJO if it cannot be deserialized.
I see the value to notify the users through the status also in this case. But this happens unfortunately outside of reconciliation loop so basically handling such error would require quite specific approach. Will think about that part, and see how it can be done.
given the type-safety nature of CRD-s this should basically never happen in general.
Correct, the situation described happens in case of "bugs" or misalignments in between implementations ( crd-generated
CR / jackson deserialization
in this specific case).
I assume that this happens when there are multiple resources versions and no conversion hooks in place.
In this case, the issue is reproducible with a single version of a single CRD.
But this happens unfortunately outside of reconciliation loop so basically handling such error would require quite specific approach.
I understand this technical limitation, but we can think about triggering a synthetic updateErrorStatus
of the Controller
when an Exception is thrown by the Informers.
I do believe that, for production-grade Operators, we should be able to somehow show(and propagate to the user logic) the fact that something is going wrong and avoid swallowing and silently ignoring the exceptions.
I do believe that, for production-grade Operators, we should be able to somehow show(and propagate to the user logic) the fact that something is going wrong and avoid swallowing and silently ignoring the exceptions.
This is what I meant with this:
But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error.
In my experience this is what you have anyways on clusters or should have. Again not sure if there is an issues with de-serialization / serialization even the updates might not work in general using the POJOs, maybe with raw api with patching. But that again probably would need a specific error handling mechanism for this case.
Correct, the situation described happens in case of "bugs" or misalignments in between implementations ( crd-generated CR / jackson deserialization in this specific case).
Could you please create an issue for fabric8 client? If there is a bug in generator that should be fixed there.
But anyways thx for this bug report!!
It's definitely worth to discuss if we should handle such errors or not, and if how. I will think about it, try to come up with a solution - probably as mentioned with raw API.
Would be good see others opinion @jmrodri @metacosm .
But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error.
I understand, still, not exposing any kind of evident issue makes the problem hard to identify and debug (e.g. when a user reports this kind of issue). Another possible idea might be to leverage Kubernetes Events?
Could you please create an issue for fabric8 client? If there is a bug in generator that should be fixed there.
I would say that we can refer to this: https://github.com/fabric8io/kubernetes-client/issues/3681
Happy to hear more feedback / opinions! ๐
I'm not sure what the proper solution is in this case but I'm definitely against crashing the operator because that would leave the door open for malicious actors to craft invalid custom resources to take down the operator.
Another possible idea might be to leverage Kubernetes Events?
What do you mean by that?
I'm definitely against crashing the operator
Fair, but we should find a way to notify that a problem occurred IMHO.
What do you mean by that?
We can possibly emit an event, possibly on the user CR or, worst case, on the operator Deployment
itself containing the relevant information.
This way the issue will be easier to spot using commands like kubectl get events
instead of having to eyeball on logs.
What do you mean by that?
We can possibly emit an event, possibly on the user CR or, worst case, on the operator
Deployment
itself containing the relevant information. This way the issue will be easier to spot using commands likekubectl get events
instead of having to eyeball on logs.
That's an interesting idea. I've never used events so I don't have experience with how they're used. However, they do seem short-lived so may be more easily missed than log inspection or alerting via monitoring?
Usually events are also persisted ideally. But are used usually to propagate information about the cluster state. Typically if a pod cannot start for some reason, there will be no logs, so an event is created for example. Or more information about nodes, or the kube proxy etc. I'm would say using them here would be a rather a mis-use, but don't have very strong opinion :)
Sorry for the late reply,
@csviri do you have any link regarding the usage of events solely for "cluster state" events? Super interested in understanding this more!
For this specific case I think that this is a decent UX:
an event is emitted on the CR marking it as "failed" (or something on this line)
In this way, people checking the CR itself will have the information about why the status is not getting updated.
@csviri do you have any link regarding the usage of events solely for "cluster state" events? Super interested in understanding this more!
I thing there is no single best resource or definition but for example here: https://www.cncf.io/blog/2020/12/10/the-top-kubernetes-apis-for-cloud-native-observability-part-1-the-kubernetes-metrics-service-container-apis-3
So I agree that this is useful to support. Problem is if we are not able to deserialize, we don't even know the resource ID (name + namespaces). But pretty sure there is a way around this too.
@csviri instantiating an "untyped" (e.g. using GenericKubernetesResource
) Informer might be one way of doing it.
yes, that is one way to approach it.
Why an informer, though? Couldn't we just deserialise the failed CR with GenericKubernetesResource
?
The failed CR doesn't reach back to the "user" code when an exception is thrown.
How would a generic informer work, though? Would that mean having a constantly running informer watching all the resources?
I think what @andreaTP means that, when an error occurs during de-serialization of a resource, we could try to de-serialize it to GenericKubernetesResource
. And the error handler could work with that from that point.
I think what @andreaTP means that, when an error occurs during de-serialization of a resource, we could try to de-serialize it to
GenericKubernetesResource
. And the error handler could work with that from that point.
That's what I meant by:
Couldn't we just deserialise the failed CR with
GenericKubernetesResource
?
Though I guess I'm not sure how that would work because, indeed, we don't have access to the deserialisation that the informer does.
Implementation wise there might be a few challenges, last time I picked up something along those lines I ended up with this: https://github.com/fabric8io/kubernetes-client/pull/3786
But, at the moment, using that mechanism would need to instantiate 2 informers per resource.
Implementation wise there might be a few challenges, last time I picked up something along those lines I ended up with this: fabric8io/kubernetes-client#3786
But, at the moment, using that mechanism would need to instantiate 2 informers per resource.
That's what I was afraid ofโฆย ๐ญ
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.
This is expected to be tackled as part of: #1422
This is expected to be tackled as part of: #1422
It's not IMO, this is a separate issue. The if an informer is not able to de-serialize a resource (probably because of a mis configuration or missing conversion hooks) it a completely different problem compared the the case when there is no permission to the resource. In first case we have the resource at hand to handle (maybe with the raw api), in other case don't have any resource.
While I agree that we try to implement this with a callback, for the other we have now a agreed design for the first iteration: https://github.com/java-operator-sdk/java-operator-sdk/issues/1422#issuecomment-1227355076
๐ I just tried to disable the stale
condition ๐
Bug Report
When the deserialization of the CR fails the operator should go into the error state (eventually retry the reconcile loop and possibly update the status with the error)
What did you do?
An unrecognized field in a CR will cause the operator to fail the deserialization but the operator stays in running state
What did you expect to see?
The Operator would update the error status of the CR or at minimum, it should crash since an unmatched exception has been thrown.
What did you see instead? Under which circumstances?
The Operator should at least go in CrashLoopBackoff.
Environment
Kubernetes cluster type:
minikube
$ Mention java-operator-sdk version from pom.xml file
Quarkus SDK
3.0.7
$ java -version
Java 11
Reproduction
and
kubectl apply
this resource:Resulting StackTrace:
but the operator is still running.