kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.38k forks source link

[BUG] SparkApplication in FAILING state has finish time #2118

Open BalaMahesh opened 2 months ago

BalaMahesh commented 2 months ago

Description

We are using spark operator v1beta2-1.6.2-3.5.0 in production. We have a spark application with the below policy.

  restartPolicy:
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: Always

When driver pod failed for some reason, operator has the below logs

UID:"263d8e51-4a32-4832-94b5-f73043e4bd69", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"1343465049", FieldPath:""}): type: 'Warning' reason: 'SparkDriverFailed' Driver **-regular-driver failed
I0810 13:25:06.814383      10 controller.go:860] Update the status of SparkApplication namespace/**-regular from:
{
  "sparkApplicationId": "spark-5a8712b706984da5bd90ac3fa7f35b86",
  "submissionID": "9b80c355-fc31-4680-b3ce-7ce34f7c31ec",
  "lastSubmissionAttemptTime": "2024-08-10T10:26:46Z",
  "terminationTime": null,
  "driverInfo": {
    "webUIServiceName": "**-regular-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.204.49.228:0",
    "podName": "**-regular-driver"
  },
  "applicationState": {
    "state": "RUNNING"
  },
  "executorState": {
    "**-8f53b6913bd57dcf-exec-7": "UNKNOWN",
    "**-8f53b6913bd57dcf-exec-8": "FAILED",
    "**-8f53b6913bd57dcf-exec-9": "COMPLETED"
  },
  "executionAttempts": 63,
  "submissionAttempts": 1
}
to:
{
  "sparkApplicationId": "spark-5a8712b706984da5bd90ac3fa7f35b86",
  "submissionID": "9b80c355-fc31-4680-b3ce-7ce34f7c31ec",
  "lastSubmissionAttemptTime": "2024-08-10T10:26:46Z",
  "terminationTime": "2024-08-10T13:25:06Z",
  "driverInfo": {
    "webUIServiceName": "**-regular-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.204.49.228:0",
    "podName": "**-regular-driver"
  },
  "applicationState": {
    "state": "FAILING",
    "errorMessage": "driver container failed with ExitCode: 1, Reason: Error"
  },
  "executorState": {
    "**-8f53b6913bd57dcf-exec-8": "FAILED",
    "**-8f53b6913bd57dcf-exec-9": "COMPLETED"
  },
  "executionAttempts": 63,
  "submissionAttempts": 1
}
I0810 13:25:06.827513      10 metrics.go:125] Decrementing spark_app_running_count with labels map[app_type:Unknown] metricVal to 4
I0810 13:25:06.827552      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-4. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827564      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 29
I0810 13:25:06.827571      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-18. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827575      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 28
I0810 13:25:06.827579      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-19. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827586      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 27
I0810 13:25:06.827593      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-3. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827607      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 26
I0810 13:25:06.827613      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-35. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827622      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 25
I0810 13:25:06.827631      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-7. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827640      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 24
I0810 13:25:06.827647      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-1. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827654      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 23
I0810 13:25:06.827662      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-24. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827669      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 22
I0810 13:25:06.827675      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-40. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827679      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 21
I0810 13:25:06.827683      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-5. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827689      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 20
I0810 13:25:06.827693      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-2. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827697      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 19
I0810 13:25:06.827701      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-32. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827707      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 18
I0810 13:25:06.827711      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-38. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827717      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 17
I0810 13:25:06.827722      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:06.827792      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:25:06.827814      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:06.827855      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:07.452988      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:25:07.453037      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:07.453109      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:07.909475      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:25:07.909518      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:25:07.909544      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:07.909634      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:07.969839      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:25:07.969874      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:25:07.969899      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:07.969972      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:08.822351      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:25:08.822380      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:25:08.822402      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:08.822479      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:37.453315      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:25:37.453383      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:37.453478      10 controller.go:274] Ending processing key: "namespace/**-regular"

I0810 13:26:07.453988      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:26:07.454063      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:26:07.454133      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:26:37.454371      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:26:37.454424      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:26:37.454507      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:07.455311      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:27:07.455388      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:07.455463      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:37.455431      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:27:37.455488      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:37.455562      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:50.436345      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:27:50.436379      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:27:50.436405      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:50.436492      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:50.440231      10 spark_pod_eventhandler.go:77] Pod **-regular-driver deleted in namespace namespace.
I0810 13:27:50.440254      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:27:50.440271      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:50.440324      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:28:07.456297      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:28:07.456357      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:28:07.456440      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:28:37.456984      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:28:37.457038      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:28:37.457124      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:29:07.457237      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:29:07.457300      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:29:07.457400      10 controller.go:274] Ending processing key: "namespace/**-regular"

I0810 13:29:37.458045      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:29:37.458105      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:29:37.458285      10 controller.go:274] Ending processing key: "namespace/**-regular"

I0810 13:30:07.458243      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:30:07.458322      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:30:07.458431      10 controller.go:274] Ending processing key: "namespace/**-regular"

driver pod is in error state and the sparkapplication state is

NAME           STATUS    ATTEMPTS   START                                FINISH                              AGE
**-regular    FAILING   63                 2024-08-10T10:26:46Z   2024-08-10T13:25:06Z   4d5h

only after deleting the sparkapplication manually, operator has the logs and started the spark application

I0810 13:30:31.202106      10 controller.go:896] Deleting pod **-regular-driver in namespace namespace
I0810 13:30:31.205664      10 controller.go:904] Deleting Spark UI Service **-regular-ui-svc in namespace namespace
I0810 13:30:31.224970      10 event.go:364] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"namespace", Name:"**-regular", UID:"263d8e51-4a32-4832-94b5-f73043e4bd69", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"1343488188", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationDeleted' SparkApplication **-regular was deleted
I0810 13:30:35.985819      10 controller.go:188] SparkApplication namespace/**-regular was added, enqueuing it for submission
I0810 13:30:35.985869      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:30:35.985929      10 driveringress.go:287] Creating a service **-regular-ui-svc for the Driver Ingress for application **-regular
I0

How to make sure that my sparkapplication gets restarted when the driver failed, this is happening regularly.

Reproduction Code [Required]

Submit the sparkApplication to spark operator with

  restartPolicy:
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: Always

when the driver pod fails, sparkapplication fails and doesn't submit again

Expected behavior

Spark application should be restarted

Actual behavior

SparkApplication has finish time and is in Failing State.

Terminal Output Screenshot(s)

Environment & Versions

Additional context

TheDevilDan commented 1 week ago

I set the restartPolicy to type: Always, but it never restarts on error, OOM Kill.

And when I edit the driver configuration in Kubernetes, I see that the policy applied is: NEVER

Whereas in the SparkApplication, it's Always!

Tip: To restart the application, sometimes go to the shell in the pod, do an exit 1 and it restarts, if it remains in error, delete the pod and it restarts... Otherwise, delete the SparkAplication and Upgrade the chart if necessary / re-apply the yaml.

On spark application : restartPolicy type: Always image

On driver config : restartPolicy type: Never (automatic definition by spark operator bug ?) image