quarkiverse / quarkus-operator-sdk

Quarkus Extension to create Kubernetes Operators in Java using the Java Operator SDK (https://github.com/java-operator-sdk/java-operator-sdk) project
Apache License 2.0
116 stars 50 forks source link

5.0.0 - OperatorHealthCheck race condition? #489

Open bhellema opened 1 year ago

bhellema commented 1 year ago

Since upgrading to 5.0.0 we're noticing different behavior when it comes to starting up our operator. Sometimes we see the following health check status in our logs indicating that Quarkus Operator SDK health check is DOWN.

{"time":"2023-01-19T06:03:38.017Z","class":"io.smallrye.health","level":"INFO","message":"SRHCK01001: Reporting health down status: {\"status\":\"DOWN\",\"checks\":[{\"name\":\"Quarkus Operator SDK health check\",\"status\":\"DOWN\"},{\"name\":\"K8s connected to kubernetes api version v1\",\"status\":\"UP\"},{\"name\":\"Quarkus Operator SDK health check\",\"status\":\"UP\",\"data\":{\"myfirstcontroller\":\"OK\",\"mysecondcontroller\":\"OK\"}},{\"name\":\"K8s connected to kubernetes api version v1\",\"status\":\"UP\"}]}"}

Could there be a race condition here on this line?

@metacosm any chance you have some insight here wrt to this potentially being a race condition? I can have mixed results when starting our operator up.

csviri commented 1 year ago

@bhellema how often you see this ? is it resolvedd shortly after?

thinking if this might be an issue in JOSDK:

https://github.com/java-operator-sdk/java-operator-sdk/blob/9d274534f8a222c931b05dc05490b744a9ea6ea4/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/health/InformerHealthIndicator.java#L11-L14

So, the ... && hasSynced() && isWatching() is also part of the expression. So hasSynced might be problem on startup. But isWatching could be problem any time, since watch might be down.

However this coin has two sides, if the informer is running but not watching it is a problem. But only if it happens for a longer time period. Since it might resemble a real problem with the watch (like lost permissions to watch a resource). I will add log messages for the next patch release, so we can verify if this is the problem.

csviri commented 1 year ago

see: https://github.com/java-operator-sdk/java-operator-sdk/pull/1737

bhellema commented 1 year ago

@csviri the issue showed up pretty frequently and wouldn't resolve.

Is there a release with this debug logging that I can try?

csviri commented 1 year ago

@bhellema just released it: https://github.com/java-operator-sdk/java-operator-sdk/releases/tag/v4.2.4

see logging: https://github.com/java-operator-sdk/java-operator-sdk/pull/1737/files

metacosm commented 1 year ago

The corresponding Quarkus extension release is 5.0.4. Please let us know if that addresses your issue.

metacosm commented 1 year ago

I believe this has been addressed in the 5.0.4 version so closing this issue for now. Please re-open if you're still encounter the problem.

bhellema commented 1 year ago

@metacosm, sorry for the very long delay in getting back to this, I finally had a bit of time to try to upgrade things. I'm still having issues using the latest quarkus-operator-sdk 6.0.1.

It looks like we're having issues getting the Kubernetes Client injected into the Health check for some reason after having upgraded. I'm still investigating.

{"time":"2023-05-10T17:28:52.489Z","class":"io.smallrye.health","level":"ERROR","message":"SRHCK01000: Error processing Health Checks [Error Occurred After Shutdown]","stackTrace":"java.lang.RuntimeException: Error injecting io.fabric8.kubernetes.client.KubernetesClient com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck.kubernetesClient\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.doCreate(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.create(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.create(Unknown Source)\n\tat io.quarkus.arc.impl.AbstractSharedContext.createInstanceHandle(AbstractSharedContext.java:113)\n\tat io.quarkus.arc.impl.AbstractSharedContext$1.get(AbstractSharedContext.java:37)\n\tat io.quarkus.arc.impl.AbstractSharedContext$1.get(AbstractSharedContext.java:34)\n\tat io.quarkus.arc.impl.LazyValue.get(LazyValue.java:26)\n\tat io.quarkus.arc.impl.ComputingCache.computeIfAbsent(ComputingCache.java:69)\n\tat io.quarkus.arc.impl.AbstractSharedContext.get(AbstractSharedContext.java:34)\n\tat io.quarkus.arc.impl.ClientProxies.getApplicationScopedDelegate(ClientProxies.java:21)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_ClientProxy.arc$delegate(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_ClientProxy.call(Unknown Source)\n\tat io.smallrye.context.impl.wrappers.SlowContextualSupplier.get(SlowContextualSupplier.java:21)\n\tat io.smallrye.mutiny.operators.uni.builders.UniCreateFromItemSupplier.subscribe(UniCreateFromItemSupplier.java:28)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniOnFailureFlatMap.subscribe(UniOnFailureFlatMap.java:31)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniOnItemTransform.subscribe(UniOnItemTransform.java:22)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination$UniHandler.subscribe(UniAndCombination.java:233)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination$AndSupervisor.run(UniAndCombination.java:87)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination.subscribe(UniAndCombination.java:53)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniBlockingAwait.await(UniBlockingAwait.java:60)\n\tat io.smallrye.mutiny.groups.UniAwait.atMost(UniAwait.java:65)\n\tat io.smallrye.health.SmallRyeHealthReporter.getReadiness(SmallRyeHealthReporter.java:232)\n\tat io.smallrye.health.SmallRyeHealthReporter_ClientProxy.getReadiness(Unknown Source)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeReadinessHandler.getHealth(SmallRyeReadinessHandler.java:11)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.doHandle(SmallRyeHealthHandlerBase.java:44)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:31)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeReadinessHandler.handle(SmallRyeReadinessHandler.java:7)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:19)\n\tat io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)\n\tat io.vertx.core.impl.ContextBase.lambda$null$0(ContextBase.java:137)\n\tat io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)\n\tat io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:135)\n\tat io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)\n\tat org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)\n\tat org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)\n\tat org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)\n\tat org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: java.lang.NullPointerException\n\tat io.quarkus.it.openshift.client.runtime.OpenShiftClientProducer_ProducerMethod_openShiftClient_ccf49c646fbe836d0e8b8406400d0ad276cd5438_Bean.get(Unknown Source)\n\tat io.quarkus.it.openshift.client.runtime.OpenShiftClientProducer_ProducerMethod_openShiftClient_ccf49c646fbe836d0e8b8406400d0ad276cd5438_Bean.get(Unknown Source)\n\t... 43 more\n","errorType":"java.lang.RuntimeException","errorMessage":"Error injecting io.fabric8.kubernetes.client.KubernetesClient com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck.kubernetesClient"}