Closed josecsotomorales closed 1 month ago
@ChenYi015 Happy to double-check if https://github.com/kubeflow/spark-operator/pull/2241 fixes this issue if we can get a new release in place 🚀
This does sound familiar and I think my retry PR might have helped because I haven't noticed this since I rolled that out. Regardless of whether its still a problem I was contemplating some ways to make deleting extra resources (driver pod, service and ingress) more robust.
I don't really know what I'm talking about but I guess there is no harm in sharing my thoughts:
app.Status.DriverInfo
.
@ChenYi015 Happy to double-check if #2241 fixes this issue if we can get a new release in place 🚀
Have released v2.1.0-rc.0.
Actually, I cannot reproduce this issue with version v2.0.2
. I can see that spark resources (driver pod, service) are deleted as expected when the app is in invalidating state.
Hey @Tom-Newton @ChenYi015, I did several tests on v2.1.0-rc.0, and I can confirm that this issue is resolved! Excellent work guys!! 🚀
Description
I’m encountering an issue with the Spark Operator where the SparkApplication fails to resubmit after entering the PENDING_RERUN state. The operator logs an error stating “failed to run spark-submit: driver pod already exist”, even though the driver pod was deleted. This issue prevents the application from restarting correctly.
Reproduction Code [Required]
Steps to reproduce the behavior:
Expected behavior
The Spark Operator should successfully resubmit the SparkApplication when it is in the PENDING_RERUN state, creating a new driver pod and continuing the execution of the application.
Actual behavior
The Spark Operator fails to resubmit the SparkApplication, logging an error:
Failed to run spark-submit: driver pod already exist
As a result, the application does not restart, and the driver pod remains in a failed state.
Terminal Output Screenshot(s)
2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"} ...
2024-10-24T00:04:55.662Z ERROR sparkapplication/controller.go:409 Failed to run spark-submit {"name": "sample-app-sample-spark", "namespace": "default", "state": "PENDING_RERUN", "error": "failed to run spark-submit: driver pod already exist"} ...
Full Logs:
Click to expand
2024-10-23T23:03:47.159Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "", "newState": "SUBMITTED"} 2024-10-23T23:03:47.175Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "SUBMITTED"} 2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"} github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication.func1 /workspace/internal/controller/sparkapplication/controller.go:260 k8s.io/client-go/util/retry.OnError.func1 /go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:51 k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection /go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/wait.go:145 k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff /go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:461 k8s.io/client-go/util/retry.OnError /go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:50 k8s.io/client-go/util/retry.RetryOnConflict /go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:104 github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication /workspace/internal/controller/sparkapplication/controller.go:247 github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).Reconcile /workspace/internal/controller/sparkapplication/controller.go:179 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:227 2024-10-23T23:03:47.215Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"} 2024-10-23T23:03:48.006Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "sample-app-sample-spark-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"} 2024-10-23T23:03:48.012Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"} 2024-10-23T23:03:48.021Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "RUNNING"} 2024-10-23T23:03:48.042Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "RUNNING"} 2024-10-23T23:03:51.900Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "sample-21c07c92bb9f476b-exec-1", "namespace": "default", "phase": "Pending"} ...Environment & Versions
Additional context
This issue might be fixed in https://github.com/kubeflow/spark-operator/pull/2241