kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.37k forks source link

Executor is FAILED even when it completes successfully #1290

Open varunbpatil opened 3 years ago

varunbpatil commented 3 years ago

This is unrelated to #671 because I don't see any log of the following type in the spark operator logs.

"Executor pod xxx not found, assuming it was deleted.


VERSIONS

Operator = v1beta2-1.2.3-3.1.1 Spark = 3.1.1 Python = 3.8

APPLICATION

examples/spark-py-pi.yaml

Name:         pyspark-pi
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
...
Spec:
  Driver:
    Core Limit:  1200m
    Cores:       1
    Labels:
      Version:        3.1.1
    Memory:           512m
    Service Account:  spark-spark
  Executor:
    Cores:      1
    Instances:  1
    Labels:
      Version:            3.1.1
    Memory:               512m
  Image:                  gcr.io/spark-operator/spark-py:v3.1.1
  Image Pull Policy:      IfNotPresent
  Main Application File:  local:///opt/spark/examples/src/main/python/pi.py
  Mode:                   cluster
  Python Version:         3
  Restart Policy:
    On Failure Retries:                    3
    On Failure Retry Interval:             10
    On Submission Failure Retries:         5
    On Submission Failure Retry Interval:  20
    Type:                                  OnFailure
  Spark Version:                           3.1.1
  Type:                                    Python
Status:
  Application State:
    State:  COMPLETED
  Driver Info:
    Pod Name:             pyspark-pi-driver
    Web UI Address:       10.152.183.114:4040
    Web UI Port:          4040
    Web UI Service Name:  pyspark-pi-ui-svc
  Execution Attempts:     1
  Executor State:
    pythonpi-6b62bd7a605f132a-exec-1:  FAILED
  Last Submission Attempt Time:        2021-07-01T04:41:08Z
  Spark Application Id:                spark-05cf08f5cdcd436d8ac76efe96a9f660
  Submission Attempts:                 1
  Submission ID:                       189085d8-f160-402b-afbd-161cc824f1b0
  Termination Time:                    2021-07-01T04:41:44Z
Events:
  Type     Reason                     Age                    From            Message
  ----     ------                     ----                   ----            -------
  Normal   SparkApplicationAdded      2m55s                  spark-operator  SparkApplication pyspark-pi was added, enqueuing it for submission
  Normal   SparkApplicationSubmitted  2m51s                  spark-operator  SparkApplication pyspark-pi was submitted successfully
  Normal   SparkDriverRunning         2m46s                  spark-operator  Driver pyspark-pi-driver is running
  Normal   SparkExecutorPending       2m32s (x2 over 2m32s)  spark-operator  Executor pythonpi-6b62bd7a605f132a-exec-1 is pending
  Normal   SparkExecutorRunning       2m28s                  spark-operator  Executor pythonpi-6b62bd7a605f132a-exec-1 is running
  Warning  SparkExecutorFailed        2m15s                  spark-operator  Executor pythonpi-6b62bd7a605f132a-exec-1 failed
  Normal   SparkDriverCompleted       2m15s                  spark-operator  Driver pyspark-pi-driver completed
  Normal   SparkApplicationCompleted  2m14s                  spark-operator  SparkApplication pyspark-pi completed


DRIVER LOGS

...
Pi is roughly 3.138680
21/07/01 04:10:34 INFO SparkUI: Stopped Spark web UI at http://delta-test-45ea187a6041573c-driver-svc.default.svc:4040
21/07/01 04:10:34 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
21/07/01 04:10:34 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
21/07/01 04:10:34 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
21/07/01 04:10:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/07/01 04:10:35 INFO MemoryStore: MemoryStore cleared
21/07/01 04:10:35 INFO BlockManager: BlockManager stopped
21/07/01 04:10:35 INFO BlockManagerMaster: BlockManagerMaster stopped
21/07/01 04:10:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/07/01 04:10:35 INFO SparkContext: Successfully stopped SparkContext
21/07/01 04:10:36 INFO ShutdownHookManager: Shutdown hook called
21/07/01 04:10:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-2cbc37fa-2899-47d1-a600-a9ddbe08f803
21/07/01 04:10:36 INFO ShutdownHookManager: Deleting directory /var/data/spark-b8478489-3024-4eb9-94d5-e3cf3bebc10b/spark-ed1e6fe4-eafe-4488-beee-546b1dd4017d/pyspark-560efbb6-33b0-4697-9c58-155a1c33c362
21/07/01 04:10:36 INFO ShutdownHookManager: Deleting directory /var/data/spark-b8478489-3024-4eb9-94d5-e3cf3bebc10b/spark-ed1e6fe4-eafe-4488-beee-546b1dd4017d


EXECUTOR LOGS

...
21/07/01 04:41:33 INFO DiskBlockManager: Created local directory at /var/data/spark-27b5d96f-1d93-4418-bfdb-bd30ee3b4192/blockmgr-9119b6c2-2d64-4dd3-91ab-87e26ab8fb5f
21/07/01 04:41:34 INFO MemoryStore: MemoryStore started with capacity 117.0 MiB
21/07/01 04:41:34 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@pyspark-pi-0597467a605ed832-driver-svc.default.svc:7078
21/07/01 04:41:34 INFO ResourceUtils: ==============================================================
21/07/01 04:41:34 INFO ResourceUtils: No custom resources configured for spark.executor.
21/07/01 04:41:34 INFO ResourceUtils: ==============================================================
21/07/01 04:41:34 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/07/01 04:41:34 INFO Executor: Starting executor ID 1 on host 10.1.30.238
21/07/01 04:41:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45005.
21/07/01 04:41:34 INFO NettyBlockTransferService: Server created on 10.1.30.238:45005
21/07/01 04:41:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/07/01 04:41:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, 10.1.30.238, 45005, None)
21/07/01 04:41:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 10.1.30.238, 45005, None)
21/07/01 04:41:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, 10.1.30.238, 45005, None)
21/07/01 04:41:39 INFO CoarseGrainedExecutorBackend: Got assigned task 0
21/07/01 04:41:39 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/07/01 04:41:40 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
21/07/01 04:41:40 INFO TransportClientFactory: Successfully created connection to pyspark-pi-0597467a605ed832-driver-svc.default.svc/10.1.30.204:7079 after 3 ms (0 ms spent in bootstraps)
21/07/01 04:41:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.9 KiB, free 117.0 MiB)
21/07/01 04:41:40 INFO TorrentBroadcast: Reading broadcast variable 0 took 150 ms
21/07/01 04:41:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 10.2 KiB, free 116.9 MiB)
21/07/01 04:41:40 INFO PythonRunner: Times: total = 437, boot = 337, init = 20, finish = 80
21/07/01 04:41:40 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1357 bytes result sent to driver
21/07/01 04:41:40 INFO CoarseGrainedExecutorBackend: Got assigned task 1
21/07/01 04:41:40 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/07/01 04:41:41 INFO PythonRunner: Times: total = 92, boot = -48, init = 50, finish = 90
21/07/01 04:41:41 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1314 bytes result sent to driver
21/07/01 04:41:41 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
21/07/01 04:41:41 INFO MemoryStore: MemoryStore cleared
21/07/01 04:41:41 INFO BlockManager: BlockManager stopped
21/07/01 04:41:41 INFO ShutdownHookManager: Shutdown hook called


OPERATOR LOGS

...
I0701 04:41:44.512024       9 controller.go:795] Update the status of SparkApplication default/pyspark-pi from:
{
  "sparkApplicationId": "spark-05cf08f5cdcd436d8ac76efe96a9f660",
  "submissionID": "189085d8-f160-402b-afbd-161cc824f1b0",
  "lastSubmissionAttemptTime": "2021-07-01T04:41:08Z",
  "terminationTime": null,
  "driverInfo": {
    "webUIServiceName": "pyspark-pi-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.152.183.114:4040",
    "podName": "pyspark-pi-driver"
  },
  "applicationState": {
    "state": "RUNNING"
  },
  "executorState": {
    "pythonpi-6b62bd7a605f132a-exec-1": "RUNNING"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
to:
{
  "sparkApplicationId": "spark-05cf08f5cdcd436d8ac76efe96a9f660",
  "submissionID": "189085d8-f160-402b-afbd-161cc824f1b0",
  "lastSubmissionAttemptTime": "2021-07-01T04:41:08Z",
  "terminationTime": null,
  "driverInfo": {
    "webUIServiceName": "pyspark-pi-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.152.183.114:4040",
    "podName": "pyspark-pi-driver"
  },
  "applicationState": {
    "state": "RUNNING"
  },
  "executorState": {
    "pythonpi-6b62bd7a605f132a-exec-1": "FAILED"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
I0701 04:41:44.513454       9 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"default", Name:"pyspark-pi", UID:"4f11bd7a-33c6-49c2-ac80-9eae17a69423", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"300534", FieldPath:""}): type: 'Warning' reason: 'SparkExecutorFailed' Executor pythonpi-6b62bd7a605f132a-exec-1 failed
I0701 04:41:44.864026       9 sparkapp_metrics.go:273] Exporting Metrics for Executor pythonpi-6b62bd7a605f132a-exec-1. OldState: RUNNING NewState: FAILED
I0701 04:41:44.864116       9 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 0
I0701 04:41:44.864178       9 controller.go:270] Ending processing key: "default/pyspark-pi"
I0701 04:41:44.870072       9 controller.go:223] SparkApplication default/pyspark-pi was updated, enqueuing it
I0701 04:41:44.870149       9 controller.go:263] Starting processing key: "default/pyspark-pi"
I0701 04:41:44.870295       9 controller.go:270] Ending processing key: "default/pyspark-pi"
I0701 04:41:44.949853       9 spark_pod_eventhandler.go:58] Pod pyspark-pi-driver updated in namespace default.
I0701 04:41:44.949922       9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication default/pyspark-pi for app update processing.
I0701 04:41:44.949971       9 controller.go:263] Starting processing key: "default/pyspark-pi"
I0701 04:41:44.950448       9 controller.go:795] Update the status of SparkApplication default/pyspark-pi from:
{
  "sparkApplicationId": "spark-05cf08f5cdcd436d8ac76efe96a9f660",
  "submissionID": "189085d8-f160-402b-afbd-161cc824f1b0",
  "lastSubmissionAttemptTime": "2021-07-01T04:41:08Z",
  "terminationTime": null,
  "driverInfo": {
    "webUIServiceName": "pyspark-pi-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.152.183.114:4040",
    "podName": "pyspark-pi-driver"
  },
  "applicationState": {
    "state": "RUNNING"
  },
  "executorState": {
    "pythonpi-6b62bd7a605f132a-exec-1": "FAILED"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
to:
{
  "sparkApplicationId": "spark-05cf08f5cdcd436d8ac76efe96a9f660",
  "submissionID": "189085d8-f160-402b-afbd-161cc824f1b0",
  "lastSubmissionAttemptTime": "2021-07-01T04:41:08Z",
  "terminationTime": "2021-07-01T04:41:44Z",
  "driverInfo": {
    "webUIServiceName": "pyspark-pi-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.152.183.114:4040",
    "podName": "pyspark-pi-driver"
  },
  "applicationState": {
    "state": "SUCCEEDING"
  },
  "executorState": {
    "pythonpi-6b62bd7a605f132a-exec-1": "FAILED"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
I0701 04:41:44.951980       9 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"default", Name:"pyspark-pi", UID:"4f11bd7a-33c6-49c2-ac80-9eae17a69423", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"300557", FieldPath:""}): type: 'Normal' reason: 'SparkDriverCompleted' Driver pyspark-pi-driver completed
I0701 04:41:45.500135       9 metrics.go:125] Decrementing spark_app_running_count with labels map[app_type:Unknown] metricVal to 0
I0701 04:41:45.500230       9 controller.go:270] Ending processing key: "default/pyspark-pi"
I0701 04:41:45.503792       9 controller.go:223] SparkApplication default/pyspark-pi was updated, enqueuing it
I0701 04:41:45.503861       9 controller.go:263] Starting processing key: "default/pyspark-pi"
I0701 04:41:45.504126       9 controller.go:795] Update the status of SparkApplication default/pyspark-pi from:
{
  "sparkApplicationId": "spark-05cf08f5cdcd436d8ac76efe96a9f660",
  "submissionID": "189085d8-f160-402b-afbd-161cc824f1b0",
  "lastSubmissionAttemptTime": "2021-07-01T04:41:08Z",
  "terminationTime": "2021-07-01T04:41:44Z",
  "driverInfo": {
    "webUIServiceName": "pyspark-pi-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.152.183.114:4040",
    "podName": "pyspark-pi-driver"
  },
  "applicationState": {
    "state": "SUCCEEDING"
  },
  "executorState": {
    "pythonpi-6b62bd7a605f132a-exec-1": "FAILED"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
to:
{
  "sparkApplicationId": "spark-05cf08f5cdcd436d8ac76efe96a9f660",
  "submissionID": "189085d8-f160-402b-afbd-161cc824f1b0",
  "lastSubmissionAttemptTime": "2021-07-01T04:41:08Z",
  "terminationTime": "2021-07-01T04:41:44Z",
  "driverInfo": {
    "webUIServiceName": "pyspark-pi-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.152.183.114:4040",
    "podName": "pyspark-pi-driver"
  },
  "applicationState": {
    "state": "COMPLETED"
  },
  "executorState": {
    "pythonpi-6b62bd7a605f132a-exec-1": "FAILED"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
I0701 04:41:45.505600       9 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"default", Name:"pyspark-pi", UID:"4f11bd7a-33c6-49c2-ac80-9eae17a69423", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"300567", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationCompleted' SparkApplication pyspark-pi completed
I0701 04:41:45.678375       9 controller.go:270] Ending processing key: "default/pyspark-pi"
I0701 04:41:45.685721       9 controller.go:223] SparkApplication default/pyspark-pi was updated, enqueuing it
I0701 04:41:45.685825       9 controller.go:263] Starting processing key: "default/pyspark-pi"
I0701 04:41:45.686589       9 controller.go:270] Ending processing key: "default/pyspark-pi"
I0701 04:41:46.658398       9 spark_pod_eventhandler.go:58] Pod pythonpi-6b62bd7a605f132a-exec-1 updated in namespace default.
I0701 04:41:46.658426       9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication default/pyspark-pi for app update processing.
I0701 04:41:46.658447       9 controller.go:263] Starting processing key: "default/pyspark-pi"
I0701 04:41:46.658540       9 controller.go:270] Ending processing key: "default/pyspark-pi"
I0701 04:41:46.710940       9 spark_pod_eventhandler.go:77] Pod pythonpi-6b62bd7a605f132a-exec-1 deleted in namespace default.
I0701 04:41:46.710996       9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication default/pyspark-pi for app update processing.
I0701 04:41:46.711042       9 controller.go:263] Starting processing key: "default/pyspark-pi"
I0701 04:41:46.711172       9 controller.go:270] Ending processing key: "default/pyspark-pi"
oleksiilopasov commented 3 years ago

I'm having the same crap. As far as I see, there was a fix here https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/716/files But then it was cancelled by this PR https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/1111/files

AFAIK the operator defines the executor status depending on indirect signs but not on executors exit code. Poor logic, but maybe they know smth, what we don't

varunbpatil commented 3 years ago

Adding @ImpSy , the author of https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/1111/files for comments. It seems like a pretty big blocker that executor state is pretty much always FAILED.

snowsky commented 1 year ago

May run into the same issue. The job will keep running even after the Spark Application job is complete. Can someone please revert the changes? :)

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

ChenYi015 commented 3 weeks ago

/reopen

google-oss-prow[bot] commented 3 weeks ago

@ChenYi015: Reopened this issue.

In response to [this](https://github.com/kubeflow/spark-operator/issues/1290#issuecomment-2434409550): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.