kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

[BUG] SparkApplication fails to resubmit after entering the PENDING_RERUN state #2283

Closed josecsotomorales closed 1 month ago

josecsotomorales commented 1 month ago

Description

I’m encountering an issue with the Spark Operator where the SparkApplication fails to resubmit after entering the PENDING_RERUN state. The operator logs an error stating “failed to run spark-submit: driver pod already exist”, even though the driver pod was deleted. This issue prevents the application from restarting correctly.

•   ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

Steps to reproduce the behavior:

1.  Submit a SparkApplication to the Kubernetes cluster.
2.  Allow the application to reach the RUNNING state.
3.  Trigger an event that causes the application to enter the INVALIDATING state (e.g., by updating the application or deleting a pod).
4.  Observe that the application transitions to the PENDING_RERUN state.
5.  The operator attempts to resubmit the application but fails with the error “driver pod already exist”.

Expected behavior

The Spark Operator should successfully resubmit the SparkApplication when it is in the PENDING_RERUN state, creating a new driver pod and continuing the execution of the application.

Actual behavior

The Spark Operator fails to resubmit the SparkApplication, logging an error:

Failed to run spark-submit: driver pod already exist

As a result, the application does not restart, and the driver pod remains in a failed state.

Terminal Output Screenshot(s)

2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"} ...

2024-10-24T00:04:55.662Z ERROR sparkapplication/controller.go:409 Failed to run spark-submit {"name": "sample-app-sample-spark", "namespace": "default", "state": "PENDING_RERUN", "error": "failed to run spark-submit: driver pod already exist"} ...

Full Logs:

Click to expand 2024-10-23T23:03:47.159Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "", "newState": "SUBMITTED"} 2024-10-23T23:03:47.175Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "SUBMITTED"} 2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"} github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication.func1 /workspace/internal/controller/sparkapplication/controller.go:260 k8s.io/client-go/util/retry.OnError.func1 /go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:51 k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection /go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/wait.go:145 k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff /go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:461 k8s.io/client-go/util/retry.OnError /go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:50 k8s.io/client-go/util/retry.RetryOnConflict /go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:104 github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication /workspace/internal/controller/sparkapplication/controller.go:247 github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).Reconcile /workspace/internal/controller/sparkapplication/controller.go:179 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:227 2024-10-23T23:03:47.215Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"} 2024-10-23T23:03:48.006Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "sample-app-sample-spark-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"} 2024-10-23T23:03:48.012Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"} 2024-10-23T23:03:48.021Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "RUNNING"} 2024-10-23T23:03:48.042Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "RUNNING"} 2024-10-23T23:03:51.900Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "sample-21c07c92bb9f476b-exec-1", "namespace": "default", "phase": "Pending"} ...

Environment & Versions

•   Spark Operator App version: v2.0.2
•   Helm Chart Version: 2.0.2
•   Kubernetes Version: v1.30
•   Apache Spark version: 3.5.3

Additional context

•   The issue occurs consistently under the given reproduction steps.
•   It appears that the operator does not properly clean up or recognize the state of the driver pod during a rerun.
•   Manually deleting the driver pod does not resolve the issue; the operator continues to report that the driver pod already exists.
•   This issue impacts our ability to automatically restart Spark applications upon failure.

This issue might be fixed in https://github.com/kubeflow/spark-operator/pull/2241

josecsotomorales commented 1 month ago

@ChenYi015 Happy to double-check if https://github.com/kubeflow/spark-operator/pull/2241 fixes this issue if we can get a new release in place 🚀

Tom-Newton commented 1 month ago

This does sound familiar and I think my retry PR might have helped because I haven't noticed this since I rolled that out. Regardless of whether its still a problem I was contemplating some ways to make deleting extra resources (driver pod, service and ingress) more robust.

I don't really know what I'm talking about but I guess there is no harm in sharing my thoughts:

  1. Delete resources according to the name that the new sparkApplication wants to use not just what is in app.Status.DriverInfo.
  2. When deleting extra resources always check some ID, in addition to the name. I think this might protect from some race conditions if a spark application name is reused quickly after previous spark application completed.
ChenYi015 commented 1 month ago

@ChenYi015 Happy to double-check if #2241 fixes this issue if we can get a new release in place 🚀

Have released v2.1.0-rc.0.

Actually, I cannot reproduce this issue with version v2.0.2. I can see that spark resources (driver pod, service) are deleted as expected when the app is in invalidating state.

josecsotomorales commented 1 month ago

Hey @Tom-Newton @ChenYi015, I did several tests on v2.1.0-rc.0, and I can confirm that this issue is resolved! Excellent work guys!! 🚀