kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

Webhook is dissapearing #910

Open michalzxc opened 4 years ago

michalzxc commented 4 years ago

Hi, I don't know how to really debug it, but our Spark Operator webhook is randomly dissapearing It is there:

kubectl get MutatingWebhookConfiguration                                                                                       
NAME                                          CREATED AT
istio-sidecar-injector                        2020-02-17T16:28:36Z
pod-identity-webhook                          2020-01-22T19:10:45Z
spark-operator-sparkoperator-webhook-config   2020-05-11T13:31:29Z
vault-secrets-webhook                         2020-02-05T14:49:34Z

And later we see pods crashing in spark namespace and when we check it is gone.

Today it happen after one hour, not really sure how to debug Our helm values.yaml:

  sparkJobNamespace: "spark"
  sparknamespaceselector: "sparkinjector=enabled"
  replicas: 4
  enableLeaderElection: true
michalzxc commented 4 years ago

If it is really spark-operator getting mad I consider to take away its rback right to touch webhook

michalzxc commented 4 years ago

It didnt like not beeing able to touch webhook:

F0512 15:06:30.565758       9 main.go:199] mutatingwebhookconfigurations.admissionregistration.k8s.io "spark-operator-sparkoperator-webhook-config" is forbidden: User "system:serviceaccount:spark-operator:spark-operator-sparkoperator" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
michalzxc commented 4 years ago

It was gone again, it seems it is mostly happen during/after new release. We have helm chart with ~20 sparks jobs, some new pods got mutated other didn't because webhook was already gone.


    May 12 15:04:09.417 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:04:09.417490 9 controller.go:113] Stopping the ScheduledSparkApplication controller
    May 12 15:04:09.417 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:04:09.417476 9 controller.go:171] Stopping the SparkApplication controller
    May 12 15:04:09.417 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:04:09.417449 9 main.go:225] Shutting down the Spark Operator
    May 12 15:04:04.177 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:04:04.177502 9 controller.go:164] Syncing ScheduledSparkApplication spark/dam-user-library-batch
    May 12 15:04:04.135 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:04:04.135143 9 submission.go:63] spark-submit arguments: [/opt/spark/bin/spark-submit --master k8s://https://172.20.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=spark --conf spark.app.name=dam-arm-outcomes --conf spark.kubernetes.driver.pod.name=dam-arm-outcomes-driver --jars /usr/external_jars/spark_streaming_kafka_assembly.jar --conf spark.kubernetes.container.image=122558522240.dkr.ecr.eu-west-1.amazonaws.com/dam:18854a72f89d69f258426f484b4c200dff553470 --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.streaming.driver.writeAheadLog.batchingTimeout=15000 --conf spark.executor.heartbeatInterval=60s --conf spark.network.timeout=900s --conf spark.streaming.receiver.writeAheadLog.closeFileAfterWrite=true --conf spark.streaming.backpressure.enabled=true --conf spark.kubernetes.driver.secrets.dam=/opt/spark/conf/envs --conf spark
    May 12 15:04:04.135 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:04:04.135041 9 controller.go:258] Starting processing key: "spark/dam-arm-outcomes"
    May 12 15:04:04.134 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:04:04.134998 9 controller.go:218] SparkApplication spark/dam-arm-outcomes was updated, enqueueing it
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    } map[] 0 1}]
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN Please initialize the log4j system properly.
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    20/05/12 14:03:19 INFO ShutdownHookManager: Deleting directory /tmp/spark-be44c4fc-516b-499b-8397-5daa0b46b0b6
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    20/05/12 14:03:19 INFO ShutdownHookManager: Shutdown hook called
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    ... 50 more
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.Okio$2.read(Okio.java:139)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.socketRead0(Native Method)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    Caused by: java.net.SocketTimeoutException: Read timed out
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    ... 17 more
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:326)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:796)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:234)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.RealCall.execute(RealCall.java:69)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.318 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.Okio$4.newTimeoutException(Okio.java:230)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    Caused by: java.net.SocketTimeoutException: timeout
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:322)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:329)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [spark] failed.
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN Please initialize the log4j system properly.
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN Please initialize the log4j system properly.
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN Please initialize the log4j system properly.
    May 12 15:03:41.250 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    } map[] 0 1}]
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    20/05/12 14:03:19 INFO ShutdownHookManager: Deleting directory /tmp/spark-be44c4fc-516b-499b-8397-5daa0b46b0b6
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    20/05/12 14:03:19 INFO ShutdownHookManager: Shutdown hook called
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    ... 50 more
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.Okio$2.read(Okio.java:139)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at java.net.SocketInputStream.socketRead0(Native Method)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    Caused by: java.net.SocketTimeoutException: Read timed out
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    ... 17 more
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:326)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:796)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:234)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.RealCall.execute(RealCall.java:69)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at okio.Okio$4.newTimeoutException(Okio.java:230)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    Caused by: java.net.SocketTimeoutException: timeout
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:322)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:329)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
    May 12 15:03:41.183 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
    May 12 15:03:40.553 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.553998 9 spark_pod_eventhandler.go:77] Pod dam-lp-responses-driver deleted in namespace spark.
    May 12 15:03:40.553 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.553992 9 spark_pod_eventhandler.go:58] Pod dam-lp-responses-driver updated in namespace spark.
    May 12 15:03:40.553 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.553977 9 spark_pod_eventhandler.go:58] Pod dam-lp-responses-driver updated in namespace spark.
    May 12 15:03:40.553 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.553970 9 spark_pod_eventhandler.go:58] Pod dam-lp-responses-driver updated in namespace spark.
    May 12 15:03:40.553 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.553948 9 spark_pod_eventhandler.go:47] Pod dam-lp-responses-driver added in namespace spark.
    May 12 15:03:40.553 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.553792 9 spark_pod_eventhandler.go:58] Pod dam-redshift-sink-driver updated in namespace spark.
    May 12 15:03:40.533 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.533659 9 controller.go:218] SparkApplication spark/dam-event-tag-views was updated, enqueueing it
    May 12 15:03:40.532 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.532800 9 spark_pod_eventhandler.go:58] Pod dam-redshift-sink-driver updated in namespace spark.
    May 12 15:03:40.532 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.532506 9 controller.go:218] SparkApplication spark/dam-event-tag-views was updated, enqueueing it
    May 12 15:03:40.532 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.532281 9 controller.go:218] SparkApplication spark/dam-event-tag-views was updated, enqueueing it
    May 12 15:03:40.459 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.459016 9 controller.go:218] SparkApplication spark/dam-lp-responses was updated, enqueueing it
    May 12 15:03:40.458 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.458784 9 controller.go:218] SparkApplication spark/dam-lp-responses was updated, enqueueing it
    May 12 15:03:40.457 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.457869 9 controller.go:218] SparkApplication spark/dam-lp-responses was updated, enqueueing it
    May 12 15:03:40.456 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.456446 9 controller.go:218] SparkApplication spark/dam-lp-responses was updated, enqueueing it
    May 12 15:03:40.456 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.456125 9 controller.go:218] SparkApplication spark/dam-lp-responses was updated, enqueueing it
    May 12 15:03:40.454 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.454159 9 controller.go:218] SparkApplication spark/dam-redshift-sink was updated, enqueueing it
    May 12 15:03:40.453 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.453321 9 controller.go:218] SparkApplication spark/dam-event-tag-views was updated, enqueueing it
    May 12 15:03:40.452 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.452434 9 controller.go:218] SparkApplication spark/dam-event-views was updated, enqueueing it
    May 12 15:03:40.451 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.451493 9 controller.go:218] SparkApplication spark/dam-event-views was updated, enqueueing it
    May 12 15:03:40.445 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.445782 9 spark_pod_eventhandler.go:47] Pod dam-redshift-sink-driver added in namespace spark.
    May 12 15:03:40.445 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.445706 9 spark_pod_eventhandler.go:58] Pod dam-analytics-driver updated in namespace spark.
    May 12 15:03:40.259 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.259722 9 spark_pod_eventhandler.go:58] Pod dam-analytics-driver updated in namespace spark.
    May 12 15:03:40.259 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.259714 9 spark_pod_eventhandler.go:47] Pod dam-analytics-driver added in namespace spark.
    May 12 15:03:40.259 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.259707 9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication spark/dam-sessions for app update processing.
    May 12 15:03:40.259 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.259700 9 spark_pod_eventhandler.go:58] Pod dam-sessions-driver updated in namespace spark.
    May 12 15:03:40.259 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.259684 9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication spark/dam-sessions for app update processing.
    May 12 15:03:40.259 prod001static01-313411262324660.ad.dice.fm  spark-operator-sparkoperator    I0512 14:03:40.259661 9 spark_pod_eventhandler.go:58] Pod dam-sessions-driver updated in namespace spark.
michalzxc commented 4 years ago

I removed rbac permission to delete webhook from operator and it "solves" practical aspect of problem but not core issue tho

mirajgodha commented 4 years ago

@michalzxc Thanks we also see the same behaviour in our cluster.

ahuret commented 2 years ago

Hello here :wave: We also face it from time to time, we don't know how to reproduce. We saw in kube-apiserver logs DELETE mutatingwebhookconfigurations statements. As we understand only delete actions in spark-operator come when the application shutdown. Sometime, when it shutdowns, it appears at "Running" but stay in the shutdown process without restarting (infinite loop in hook.Stop() ?), no clue why it shutdown at first place...

artur-bolt commented 1 year ago

Hi,

as a temporary fix I've implemented a livenessProbe for the Deployment so it checks if mutating webhook has a mismatch and restarts container to refresh certificates and match them together. Seems to be working for now

livenessProbe:
  initialDelaySeconds: 1
  periodSeconds: 1
  failureThreshold: 1
  exec:
    command:
      - sh
      - -c
      - |
        set -e
        curl -iks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
          https://kubernetes.default.svc/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/{{ include "spark-operator.fullname" . }}-webhook-config \
          | grep -o '"caBundle": "[^"]*"' \
          | awk -F'"' '{print $4}' \
          | base64 -d > /tmp/expected_ca_bundle.crt
        expected_ca_bundle=$(cat /etc/webhook-certs/ca-cert.pem)
        actual_ca_bundle=$(cat /tmp/expected_ca_bundle.crt)
        if [ "$expected_ca_bundle" != "$actual_ca_bundle" ]; then
          exit 1
        fi
github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.