Sparkapp stuck in PENDING_RERUN with driver pod not started

FloraZhang commented 3 years ago

Hello expert,

I found this issue when I start multiple spark applications with spark-operator, that sometimes one or more sparkapplication wouldn't start. Running 'kubectl describe sparkapplication ' would show the driver pod is not started, and it will be stuck in PENDING_RERUN status. Currently my workaround is delete the active sparkoperator pod and force the new pod to reschedule the driver pod.

Name:         detector-app
Namespace:    default
Labels:       app.kubernetes.io/instance=baseline-app
              app.kubernetes.io/managed-by=Tiller
              app.kubernetes.io/name=baseline-app
              helm.sh/chart=baseline-app-21.3.0-rel.88
Annotations:  helm.sh/hook: post-install, post-upgrade
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2021-03-04T14:19:13Z
  Generation:          1
  Resource Version:    24847
  Self Link:           /apis/sparkoperator.k8s.io/v1beta2/namespaces/default/sparkapplications/detector-app
  UID:                 ca47d4f4-5fdf-44f0-9e7f-055722090444
Spec:
  Deps:
    Files:
      local:///opt/rt/spark-app/application.conf
    Jars:
      local:///opt/rt/spark/extra-jars/spark-sql-kafka-0-10_2.11-2.4.4.jar
      local:///opt/rt/spark/extra-jars/spark-streaming-kafka-0-10_2.11-2.4.4.jar
      local:///opt/rt/spark/extra-jars/kafka-clients-2.6.0.jar
      local:///opt/rt/spark/extra-jars/jmx_prometheus_javaagent-0.12.0.jar
      local:///opt/rt/spark/extra-jars/postgresql-42.1.4.jar
      local:///opt/rt/spark/jars/vertica-jdbc-9.3.0-0.jar
  Driver:
    Core Limit:  1000m
    Cores:       1
    Env:
      Name:   TEST
      Value:  /opt/rt/spark/conf
    Init Containers:
      Env:
        Name:  POD_NAME
        Value From:
          Field Ref:
            Field Path:  metadata.name
        Name:            POD_IP
        Value From:
          Field Ref:
            Field Path:  status.podIP
        Name:            SAN_HOSTS
        Value:           135.228.130.234,$(POD_IP),anomaly-detector-app-ui-svc.default.svc.cluster.local
        Name:            EXTERNAL_KEYSTORE_PASSWORD
        Value:           rt1!
        Name:            EXTERNAL_TRUSTSTORE_PASSWORD
        Value:           rt1!
        Name:            INTERNAL_KEYSTORE_PASSWORD
        Value:           rt1!
        Name:            INTERNAL_TRUSTSTORE_PASSWORD
        Value:           rt1!
      Image:             orbw-artifactory.ca.alcatel-lucent.com:8081/rt-docker-virtual/rt-pki-container:21.3.0-rel.11
      Name:              rt-pki
      Pull Policy:       IfNotPresent
      Volume Mounts:
        Mount Path:  /opt/rt/os/ssl
        Name:        shared-tls-volume
        Mount Path:  /rtTls
        Name:        rt-tls-zip
      Command:
        sh
        -c
        grep "standby" /opt/checkpoint-data/drStatus.log
if [ $? == 0 ]; then
  find /opt/checkpoint-data -type f -delete
fi

      Image:        orbw-artifactory.ca.alcatel-lucent.com:8081/rt-docker-virtual/tools/kubectl:v1.14.10-nano
      Name:         cleanup-checkpoint
      Pull Policy:  IfNotPresent
      Volume Mounts:
        Mount Path:  /opt/checkpoint-data
        Name:        checkpoint-volume
        Sub Path:    baseline-app/rt-detector/checkpoint-data
    Java Options:    -Dlog4j.configuration=file:///opt/rt/spark/conf/detector_driver-log4j.properties -Drt.application.conf.dir=/opt/rt/spark-app -XX:+PrintClassHistogramBeforeFullGC  -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -Xloggc:/opt/spark/logs/rtosgc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=5M -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseContainerSupport -XX:InitialRAMPercentage=50.0 -XX:MinRAMPercentage=20.0 -XX:MaxRAMPercentage=85.0  -XX:InitiatingHeapOccupancyPercent=25
    Labels:
      Sparkapp:  detector
    Lifecycle:
      Pre Stop:
        Exec:
          Command:
            /bin/sh
            -c
            response=$(curl -s -k \
-X GET \
-H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
https://kubernetes.default.svc/apis/apps/v1/namespaces/default/deployments/spark-operator)
response=`echo $response | grep -o -E "\"rt/role\"[^}]*" | awk -v FS="\"rt/role\": " 'NF>1{print $2}' | awk -F '"' '{print $2}'`
echo "$(date), site status is $response" > /opt/checkpoint-data/drStatus.log

    Memory:           1G
    Memory Overhead:  1G
    Pod Security Context:
      Run As User:  1000
    Security Context:
      Fs Group:         1000
      Run As Group:     1000
      Run As Non Root:  true
      Run As User:      1000
    Service Account:    rt
    Volume Mounts:
      Mount Path:  /opt/rt/spark-app/conf/ignite.xml
      Name:        ignite-configs-vol
      Sub Path:    ignite.xml
      Mount Path:  /opt/rt/spark/conf
      Name:        log4j-configs-vol
      Mount Path:  /opt/rt/spark-app/application.conf
      Name:        application-conf-configs-vol
      Sub Path:    detector-application.conf
      Mount Path:  /opt/rt/spark-app/logs
      Name:        sparklog
      Mount Path:  /opt/checkpoint-data
      Name:        checkpoint-volume
      Sub Path:    baseline-app/detector/checkpoint-data
      Mount Path:  /opt/rt/os/ssl
      Name:        shared-tls-volume
      Mount Path:  /rtTls
      Name:        rt-tls-zip
  Executor:
    Affinity:
      Pod Anti Affinity:
        Preferred During Scheduling Ignored During Execution:
          Pod Affinity Term:
            Label Selector:
              Match Expressions:
                Key:       sparkapp
                Operator:  In
                Values:
                  detector-executor
            Topology Key:  kubernetes.io/hostname
          Weight:          100
    Core Limit:            1000m
    Cores:                 1
    Init Containers:
      Env:
        Name:  POD_NAME
        Value From:
          Field Ref:
            Field Path:  metadata.name
        Name:            POD_IP
        Value From:
          Field Ref:
            Field Path:  status.podIP
        Name:            SAN_HOSTS
        Value:           135.228.130.234,$(POD_IP),anomaly-detector-app-ui-svc.default.svc.cluster.local
        Name:            EXTERNAL_KEYSTORE_PASSWORD
        Value:           rt1!
        Name:            EXTERNAL_TRUSTSTORE_PASSWORD
        Value:           rt1!
        Name:            INTERNAL_KEYSTORE_PASSWORD
        Value:           rt1!
        Name:            INTERNAL_TRUSTSTORE_PASSWORD
        Value:           rt1!
      Image:             orbw-artifactory.ca.alcatel-lucent.com:8081/rt-docker-virtual/rt-pki-container:21.3.0-rel.11
      Name:              rt-pki
      Pull Policy:       IfNotPresent
      Volume Mounts:
        Mount Path:  /opt/rt/os/ssl
        Name:        shared-tls-volume
        Mount Path:  /rtTls
        Name:        rt-tls-zip
    Instances:       3
    Java Options:    -Dlog4j.configuration=file:///opt/rt/spark/conf/detector_executor-log4j.properties -Drt.application.conf.dir=/opt/rt/spark-app -XX:+PrintClassHistogramBeforeFullGC  -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -Xloggc:/opt/spark/logs/rtosgc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=5M -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseContainerSupport -XX:InitialRAMPercentage=50.0 -XX:MinRAMPercentage=20.0 -XX:MaxRAMPercentage=85.0  -XX:InitiatingHeapOccupancyPercent=25
    Labels:
      Sparkapp:       detector-executor
    Memory:           1G
    Memory Overhead:  3G
    Pod Security Context:
      Run As User:        1000
    Priority Class Name:  rt-high-priority
    Security Context:
      Fs Group:         1000
      Run As Group:     1000
      Run As Non Root:  true
      Run As User:      1000
    Volume Mounts:
      Mount Path:         /opt/rt/spark-app/conf/ignite.xml
      Name:               ignite-configs-vol
      Sub Path:           ignite.xml
      Mount Path:         /opt/rt/spark/conf
      Name:               log4j-configs-vol
      Mount Path:         /opt/rt/spark-app/application.conf
      Name:               application-conf-configs-vol
      Sub Path:           detector-application.conf
      Mount Path:         /opt/rt/spark-app/logs
      Name:               sparklog
      Mount Path:         /opt/checkpoint-data
      Name:               checkpoint-volume
      Sub Path:           baseline-app/detector/checkpoint-data
      Mount Path:         /opt/rt/os/ssl
      Name:               shared-tls-volume
      Mount Path:         /rtTls
      Name:               rt-tls-zip
  Image:                  orbw-artifactory.ca.alcatel-lucent.com:8081/rt-docker-virtual/baseline-app:21.3.0-rel.88
  Image Pull Policy:      IfNotPresent
  Main Application File:  local:///opt/rt/spark-app/baseline-spark-all.jar
  Main Class:             com.rt.baseline.spark.detector.AnomalyDetectorApp
  Mode:                   cluster
  Monitoring:
    Expose Driver Metrics:    true
    Expose Executor Metrics:  true
    Prometheus:
      Jmx Exporter Jar:  /opt/rt/spark/extra-jars/jmx_prometheus_javaagent-0.12.0.jar
  Restart Policy:
    On Failure Retry Interval:             2
    On Submission Failure Retry Interval:  3
    Type:                                  Always
  Spark Conf:
    spark.cleaner.periodicGC.interval:         30s
    spark.kryo.unsafe:                         true
    spark.kryoserializer.buffer.max:           32m
    spark.kubernetes.driver.memoryOverhead:    0.4
    spark.kubernetes.executor.memoryOverhead:  0.4
    spark.memory.fraction:                     0.85
    spark.memory.storageFraction:              0.1
    spark.serializer:                          org.apache.spark.serializer.KryoSerializer
    spark.sql.streaming.checkpointLocation:    file:///opt/checkpoint-data
    spark.ui.port:                             40452
  Spark Version:                               2.4.4
  Type:                                        Scala
  Volumes:
    Config Map:
      Name:  detector-ignite-configmap
    Name:    ignite-configs-vol
    Config Map:
      Name:  detector-log4j-configmap
    Name:    log4j-configs-vol
    Config Map:
      Name:  detector-application-conf-configmap
    Name:    application-conf-configs-vol
    Empty Dir:
    Name:  sparklog
    Name:  checkpoint-volume
    Persistent Volume Claim:
      Claim Name:  checkpoint-detector-glusterfs-data
    Empty Dir:
    Name:  shared-tls-volume
    Name:  rt-tls-zip
    Secret:
      Secret Name:  rt-tls
Status:
  Application State:
    Error Message:  driver pod not found
    State:          PENDING_RERUN
  Driver Info:
    Pod Name:          anomaly-detector-app-driver
  Execution Attempts:  1
  Executor State:
    baseline-anomaly-detector-1614868827262-exec-1:  FAILED
    baseline-anomaly-detector-1614868827845-exec-2:  FAILED
    baseline-anomaly-detector-1614868828061-exec-3:  FAILED
  Last Submission Attempt Time:                      2021-03-04T14:40:01Z
  Spark Application Id:                              spark-06341824ebb6479bab006225f21347e9
  Submission Attempts:                               4
  Submission ID:                                     b8ae3c89-cbf3-4dd9-9e25-e1ab797695e4
  Termination Time:                                  2021-03-04T15:19:23Z
Events:

I didn't see any error in the operator log so I'm confused as to what is happening under the hood.

Thanks, Wenjing

djdillon commented 3 years ago

We are hitting the same issue. Any progress on fixing this or insight into what was happening?

djdillon commented 3 years ago

We are on spark-operator 3.1.1 if that helps. @FloraZhang any luck?

FloraZhang commented 3 years ago

@djdillon Unfortunately I'm still having this issue from time to time.

KlausTele commented 2 years ago

Having the issue as well. @FloraZhang @djdillon have you found a resolution?

KlausTele commented 2 years ago

For us, the OnFailureRetries was accidentally set to 0. It seems that setting the On Failure Retries to a large value helped in our case. In the sparkapp of this ticket it is not set and as int32 defaults into 0.

josecsotomorales commented 2 years ago

I'm also getting this issue, seems like an invalid state when replacing an existing spark app, here are my operator logs:

I0328 22:07:16.115394      10 controller.go:223] SparkApplication default/develop-app-spark was updated, enqueuing it
I0328 22:07:16.115432      10 controller.go:263] Starting processing key: "default/develop-app-spark"
I0328 22:07:16.115510      10 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"default", Name:"develop-app-spark", UID:"cdbb7f3f-9aa7-4b91-aac0-3ba12e4d6845", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"40925388", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationSpecUpdateProcessed' Successfully processed spec update for SparkApplication develop-app-spark
I0328 22:07:16.115631      10 controller.go:590] SparkApplication default/develop-app-spark is pending rerun
I0328 22:07:16.117242      10 controller.go:223] SparkApplication default/develop-app-spark was updated, enqueuing it
I0328 22:07:16.124532      10 controller.go:270] Ending processing key: "default/develop-app-spark"
I0328 22:07:16.124569      10 controller.go:263] Starting processing key: "default/develop-app-spark"
I0328 22:07:16.124610      10 controller.go:858] Deleting pod develop-app-spark-driver in namespace default
I0328 22:07:16.136951      10 controller.go:822] Update the status of SparkApplication default/develop-app-spark from:
{
  "submissionID": "eff438ec-dc5c-4052-9654-2757ef2bab09",
  "lastSubmissionAttemptTime": null,
  "terminationTime": null,
  "driverInfo": {},
  "applicationState": {
    "state": "INVALIDATING"
  }
}
to:
{
  "submissionID": "eff438ec-dc5c-4052-9654-2757ef2bab09",
  "lastSubmissionAttemptTime": null,
  "terminationTime": null,
  "driverInfo": {},
  "applicationState": {
    "state": "PENDING_RERUN"
  }
}
I0328 22:07:16.138947      10 spark_pod_eventhandler.go:58] Pod develop-app-spark-driver updated in namespace default.
I0328 22:07:16.155989      10 controller.go:223] SparkApplication default/develop-app-spark was updated, enqueuing it
I0328 22:07:16.160913      10 controller.go:270] Ending processing key: "default/develop-app-spark"
I0328 22:07:16.160968      10 controller.go:263] Starting processing key: "default/develop-app-spark"
I0328 22:07:16.161039      10 controller.go:590] SparkApplication default/develop-app-spark is pending rerun
I0328 22:07:16.167172      10 controller.go:270] Ending processing key: "default/develop-app-spark"

it's getting stuck in PENDING_RERUN state:

kubectl get sparkapplications
NAME                STATUS          ATTEMPTS   START        FINISH       AGE
develop-app-spark   PENDING_RERUN              <no value>   <no value>   6d

anry007 commented 2 years ago

We are also hitting this issue on one of our environments. It seems suspicious that ... "driverInfo": {}, ... is not populated, same as executorState is not set at all. Is that a root cause?

Are there any plans to fix this issue any time soon?

RamakrishnaHande commented 1 year ago

Hi , does anyone know if there is any prometheus metric for PENDING_RERUN state ??

jacobsalway commented 1 year ago

For anyone else having this issue, I believe it's due to resource deletion not being finalised after all OnFailureRetries have been exhausted. Once the retries have been exhausted, the app transitions to PENDING_RERUN and the controller checks if the application resources have been cleaned up. However, I think there's a silent failure on the line I've linked below. If validateSparkResourceDeletion returns false, the application is not re-submitted/state is not transitioned and no error is logged or surfaced either.

This issue has been transient and I haven't had much luck reliably reproducing so far, but I suspect you could simply increase onFailureRetries and onFailureRetriesInterval so the controller retries longer before checking for resource deletion.

Thinking about a better way to fix this properly. I would think errors from inside the validateSparkResourceDeletion function should be surfaced in the logs and the app should transition to a FAILED state if this occurs.

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/5f2efd4ff97e7c0bfdb726a066118d3401576730/pkg/controller/sparkapplication/controller.go#L605

fahmiduldul commented 2 months ago

Hi, has anyone find workaround to handle this? I'm thinking to check using some cron job to automatically restart this.

kubeflow / spark-operator

Sparkapp stuck in PENDING_RERUN with driver pod not started #1191