operator-framework / java-operator-sdk

Java SDK for building Kubernetes Operators
https://javaoperatorsdk.io/
Apache License 2.0
786 stars 212 forks source link

Namespace deletion stuck if contains CRs that are watched by the operator #1876

Open gyfora opened 1 year ago

gyfora commented 1 year ago

Bug Report

This is likely not a JOSDK bug but based on offline discussion with @csviri I am opening it here to track it.

In our current setup the operator is deployed in namespace x and is watching namespace y. The access to namespace y is controlled by roles and rolebindings (created in namespace y).

If there are CRs present in y and the namespace is deleted before the CRs are individually deleted we get the following exception during cleanup:

ERROR][flink/basic-example] Error during event processing ExecutionScope{ resource id: ResourceID{name='basic-example', namespace='flink'}, version: 1791281} failed.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
    at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:546)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:566)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:369)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:712)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:172)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:177)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:88)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:39)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher$CustomResourceFacade.updateResource(ReconciliationDispatcher.java:387)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.conflictRetryingUpdate(ReconciliationDispatcher.java:343)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleCleanup(ReconciliationDispatcher.java:297)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:87)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
    at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:701)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:681)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:628)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:591)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$5(StandardHttpClient.java:120)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:135)
    ... 3 more

Furthermore the namespace deletion gets stuck because the finalizer from the CR is never removed. The root problem seems to be when the namespace deletion is initiated the role and rolebinding is immediately deleted therefore the operator cannot remove the finalizer from the resource anymore.

Environment

Kubernetes cluster type:

kind

JOSDK version: 4.3.0

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-20T03:36:50Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/arm64"}
csviri commented 1 year ago

Yep this is more like a generic Kubernetes issue, but will clarify how to handle it here, since we have a feature (dynamic changes of watching namespaces) that is closely related.

rmetzger commented 1 year ago

when the namespace deletion is initiated the role and rolebinding is immediately deleted

Wouldn't setting a finalizer for the role and rolebinding solve the problem of immediate deletion?

csviri commented 1 year ago

when the namespace deletion is initiated the role and rolebinding is immediately deleted

Wouldn't setting a finalizer for the role and rolebinding solve the problem of immediate deletion?

yes, this sounds like a good idea.

This was suggested also here: https://kubernetes.slack.com/archives/CAW0GV7A5/p1682603213236239

But will create an issue in Kubernetes, see if it can be solved eventually on GC controller level.

csviri commented 1 year ago

What JOSDK could do is to provide reconcilers (one for role and one for rolebinding) that will handle adding finalizers and removing them, and it would up to the dev to register them them. Since this has also implication on permissions of the operator (update permission on role).

moayad-alyaghshi commented 1 year ago

Hi @csviri

we are facing the same issue that the namespace deletion is stuck, but even when the operator is deployed in the same namespace as the CRs, which is not expected according to what I understood from the Slack thread. I would appreciate any explanation.

Note: The operator has a ClusterRole and ClusterRoleBinding to work with the CRs. We're using Quarkus with quarkus-operator-sdk.

csviri commented 1 year ago

Hi @moayad-alyaghshi ,

I checked it briefly in namespace controller and the garbage collector controller when @gyfora reported this, and it seems (well as far I was able to see) there is nothing special to prevent this in K8S to happen even in the same namespace.

So this is not an issue with JOSDK, it's rather issue with K8S. What we can offer is that reconciler that solves this, just was not priority for now, scheduled this for 4.5;

Maybe it is worth asking again around this on k8s slack: https://kubernetes.slack.com/archives/CAW0GV7A5

csviri commented 9 months ago

issue in k8s: https://github.com/kubernetes/kubernetes/issues/115070

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

jessebye commented 5 months ago

The Kubernetes issue was closed. What are the next steps for this? Is there a way to solve this in java-operator-sdk?

This issue causes all our namespaces to hang on termination because the CR is never finalized by Java Operator SDK...

csviri commented 5 months ago

@jessebye yes, there is a way to solve this with custom reconcile. My plan is to implement that, also make it available as a standalone controller for non-Java/JOSDK projects. Will move this to 5.0, since more people asking for that.

cc @metacosm