stackabletech / spark-k8s-operator

Operator for Apache Spark-on-Kubernetes for Stackable Data Platform
https://stackable.tech
Other
51 stars 3 forks source link

Service account doesn't have correct permissions to delete resources #316

Closed johnfitzy closed 10 months ago

johnfitzy commented 10 months ago

Affected version

23.11

Current and expected behavior

Following the instructions here the service account that is created for the job (pyspark-pi) doesn't have the correct permissions to delete K8s resources after the jobs finishes. Pods, ConfigMaps, PVC's and Services.

Example error:

2023-12-03T20:54:11,086 ERROR [Thread-4] org.apache.spark.util.Utils - Uncaught exception in thread Thread-4
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: DELETE at: https://kubernetes.default.svc/api/v1/namespaces/default/services?labelSelector=spark-app-selector%3Dspark-835bbb64765f4e558c20de0dc60668ff. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. services is forbidden: User "system:serviceaccount:default:pyspark-pi" cannot deletecollection resource "services" in API group "" in the namespace "default".

Service Account

Name:                pyspark-pi
Namespace:           default
Labels:              app.kubernetes.io/component=service-account
                     app.kubernetes.io/instance=pyspark-pi
                     app.kubernetes.io/managed-by=spark.stackable.tech_sparkapplication
                     app.kubernetes.io/name=spark-k8s
                     app.kubernetes.io/role-group=sparkapplication
                     app.kubernetes.io/version=1.0
Annotations:         <none>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

RoleBinding

Name:         pyspark-pi
Labels:       app.kubernetes.io/component=role-binding
              app.kubernetes.io/instance=pyspark-pi
              app.kubernetes.io/managed-by=spark.stackable.tech_sparkapplication
              app.kubernetes.io/name=spark-k8s
              app.kubernetes.io/role-group=sparkapplication
              app.kubernetes.io/version=1.0
Annotations:  <none>
Role:
  Kind:  ClusterRole
  Name:  spark-k8s-clusterrole
Subjects:
  Kind            Name        Namespace
  ----            ----        ---------
  ServiceAccount  pyspark-pi  default

ClusterRole

Name:         spark-k8s-clusterrole
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: spark-k8s-operator
              meta.helm.sh/release-namespace: default
PolicyRule:
  Resources               Non-Resource URLs  Resource Names  Verbs
  ---------               -----------------  --------------  -----
  configmaps              []                 []              [create delete get list patch update watch]
  persistentvolumeclaims  []                 []              [create delete get list patch update watch]
  pods                    []                 []              [create delete get list patch update watch]
  secrets                 []                 []              [create delete get list patch update watch]
  serviceaccounts         []                 []              [create delete get list patch update watch]
  services                []                 []              [create delete get list patch update watch]
  events.events.k8s.io    []                 []              [create]

Possible solution

No response

Additional context

Environment

No response

Would you like to work on fixing this bug?

maybe

sbernauer commented 10 months ago

Thanks for the detailed description! Another user has already reported this and we have the fix in https://github.com/stackabletech/spark-k8s-operator/pull/313. Sadly this was literally 2 days after we branched off 23.11.0, so it's not part of that release. Would you be ok with using the nightly version of the spark-k8s operator?

In all cases the deployed resources should have an ownerReference to the SparkApplication, so deleting that should hopefully clean everything up

johnfitzy commented 10 months ago

Hi, thanks. Yes I can use the nightly version at the moment. I'll keep my eye out for the next release.

ruslanguns commented 8 months ago

I confirm that by editing the cluster role spark-k8s-clusterrole and add - deletecollection to the verbs section fixes the problem. I think this can be a workaround in the meantime a new version of the helm charts is published.

Thanks!