Open FloraZhang opened 3 years ago
We are hitting the same issue. Any progress on fixing this or insight into what was happening?
We are on spark-operator 3.1.1 if that helps. @FloraZhang any luck?
@djdillon Unfortunately I'm still having this issue from time to time.
Having the issue as well. @FloraZhang @djdillon have you found a resolution?
For us, the OnFailureRetries was accidentally set to 0. It seems that setting the On Failure Retries to a large value helped in our case. In the sparkapp of this ticket it is not set and as int32 defaults into 0.
I'm also getting this issue, seems like an invalid state when replacing an existing spark app, here are my operator logs:
I0328 22:07:16.115394 10 controller.go:223] SparkApplication default/develop-app-spark was updated, enqueuing it
I0328 22:07:16.115432 10 controller.go:263] Starting processing key: "default/develop-app-spark"
I0328 22:07:16.115510 10 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"default", Name:"develop-app-spark", UID:"cdbb7f3f-9aa7-4b91-aac0-3ba12e4d6845", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"40925388", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationSpecUpdateProcessed' Successfully processed spec update for SparkApplication develop-app-spark
I0328 22:07:16.115631 10 controller.go:590] SparkApplication default/develop-app-spark is pending rerun
I0328 22:07:16.117242 10 controller.go:223] SparkApplication default/develop-app-spark was updated, enqueuing it
I0328 22:07:16.124532 10 controller.go:270] Ending processing key: "default/develop-app-spark"
I0328 22:07:16.124569 10 controller.go:263] Starting processing key: "default/develop-app-spark"
I0328 22:07:16.124610 10 controller.go:858] Deleting pod develop-app-spark-driver in namespace default
I0328 22:07:16.136951 10 controller.go:822] Update the status of SparkApplication default/develop-app-spark from:
{
"submissionID": "eff438ec-dc5c-4052-9654-2757ef2bab09",
"lastSubmissionAttemptTime": null,
"terminationTime": null,
"driverInfo": {},
"applicationState": {
"state": "INVALIDATING"
}
}
to:
{
"submissionID": "eff438ec-dc5c-4052-9654-2757ef2bab09",
"lastSubmissionAttemptTime": null,
"terminationTime": null,
"driverInfo": {},
"applicationState": {
"state": "PENDING_RERUN"
}
}
I0328 22:07:16.138947 10 spark_pod_eventhandler.go:58] Pod develop-app-spark-driver updated in namespace default.
I0328 22:07:16.155989 10 controller.go:223] SparkApplication default/develop-app-spark was updated, enqueuing it
I0328 22:07:16.160913 10 controller.go:270] Ending processing key: "default/develop-app-spark"
I0328 22:07:16.160968 10 controller.go:263] Starting processing key: "default/develop-app-spark"
I0328 22:07:16.161039 10 controller.go:590] SparkApplication default/develop-app-spark is pending rerun
I0328 22:07:16.167172 10 controller.go:270] Ending processing key: "default/develop-app-spark"
it's getting stuck in PENDING_RERUN state:
kubectl get sparkapplications
NAME STATUS ATTEMPTS START FINISH AGE
develop-app-spark PENDING_RERUN <no value> <no value> 6d
We are also hitting this issue on one of our environments. It seems suspicious that
... "driverInfo": {}, ...
is not populated, same as executorState
is not set at all. Is that a root cause?
Are there any plans to fix this issue any time soon?
Hi , does anyone know if there is any prometheus metric for PENDING_RERUN state ??
For anyone else having this issue, I believe it's due to resource deletion not being finalised after all OnFailureRetries
have been exhausted. Once the retries have been exhausted, the app transitions to PENDING_RERUN
and the controller checks if the application resources have been cleaned up. However, I think there's a silent failure on the line I've linked below. If validateSparkResourceDeletion
returns false, the application is not re-submitted/state is not transitioned and no error is logged or surfaced either.
This issue has been transient and I haven't had much luck reliably reproducing so far, but I suspect you could simply increase onFailureRetries
and onFailureRetriesInterval
so the controller retries longer before checking for resource deletion.
Thinking about a better way to fix this properly. I would think errors from inside the validateSparkResourceDeletion
function should be surfaced in the logs and the app should transition to a FAILED
state if this occurs.
Hi, has anyone find workaround to handle this? I'm thinking to check using some cron job to automatically restart this.
Hello expert,
I found this issue when I start multiple spark applications with spark-operator, that sometimes one or more sparkapplication wouldn't start. Running 'kubectl describe sparkapplication' would show the driver pod is not started, and it will be stuck in PENDING_RERUN status. Currently my workaround is delete the active sparkoperator pod and force the new pod to reschedule the driver pod.
I didn't see any error in the operator log so I'm confused as to what is happening under the hood.
Thanks, Wenjing