kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.71k stars 1.36k forks source link

configmaps not getting cleaned up when spark application deleted #1596

Open jdonnelly-apixio opened 1 year ago

jdonnelly-apixio commented 1 year ago

ConfigMaps don't get cleaned up when sparkapplications are deleted. I think it might be good to include owner references for the configmaps that are created so cascading deletes can happen. I had ~160k in my cluster and it was causing timeouts listing configmaps near the end of spark jobs.

2022-08-11 00:48:10,553 ERROR Utils: Uncaught exception in thread main
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list]  for kind: [ConfigMap]  with name: [null]  in namespace: [default]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.listRequestHelper(BaseOperation.java:167)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:673)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.deleteList(BaseOperation.java:782)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.delete(BaseOperation.java:705)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.delete(BaseOperation.java:84)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.$anonfun$stop$6(KubernetesClusterSchedulerBackend.scala:133)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1419)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.stop(KubernetesClusterSchedulerBackend.scala:134)
    at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:881)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2365)
    at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2075)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1419)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:2075)
    at com.apixio.sparkapps.utils.SparkMainUtils$.runThenShutdown(SparkMainUtils.scala:14)
    at com.apixio.sparkapps.jobs.streamer.ParserIngestStreamer$.main(ParserIngestStreamer.scala:306)
    at com.apixio.sparkapps.jobs.streamer.ParserIngestStreamer.main(ParserIngestStreamer.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: timeout
    at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:672)
    at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:680)
    at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153)
    at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
    at okhttp3.RealCall.execute(RealCall.java:93)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:490)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:451)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:433)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.listRequestHelper(BaseOperation.java:160)
    ... 27 more
apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: "2022-07-28T06:36:33Z"
  labels:
    spark-app-selector: spark-9015de6452fd49779c20217dea20e7f5
    spark-role: executor
  name: spark-exec-fffb42824385691c-conf-map
  namespace: default
  resourceVersion: "387121912"
  uid: 22921745-3f41-43d0-8c92-f97f52f3d310
avivgold098 commented 1 year ago

@jdonnelly-apixio hey! We did not yet experience the issue, but this could lead to serious issues. At the moment "only" 4.5k configmaps on our end. Have you found how to clean this by the end of the job run?

supereagle commented 1 year ago

@jdonnelly-apixio Do you have solved this problem? I think we have meet the same problem, our cluster has ~120k+ configmaps

Fiorellaps commented 1 year ago

I had the following error in the driver pod:

ERROR Utils: Uncaught exception in thread Thread-19 Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. configmaps is forbidden: User "system:serviceaccount:spark-ns:spark-operator" cannot list resource "configmaps" in API group "" in the namespace "spark-ns".

I have created a clusterrolebinding to give the edit role to the service operator (called spark-operator in my case).

kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-ns:spark-operator --namespace=spark-ns

gaurav-uptycs commented 1 year ago

I am facing a similar error just that instead of configmaps its not able create pods. Can this be a issue with permissions given to the spark role ?

powerLambda commented 1 year ago

+1 for this issue.

doryer commented 11 months ago

+1 here also

wesleygoi-liftoff commented 10 months ago

+1 here

FeryET commented 1 month ago

any updates on how to resolve this issue?