Open Tom-Newton opened 4 weeks ago
We should also look when the state gets stuck in PENDING_RERUN
, usually caused by a cleanup -> submission race condition
I have a nice find on tracking the PENDING_RERUN
issue I was getting when restarting failed SparkApplications, I had my retry interval set to 5 seconds, I think there's a race condition where 5 seconds isn't enough, so I increased them to 10 seconds and it seems to fix the issue:
restartPolicy:
type: Always
onFailureRetries: 10
onFailureRetryInterval: 10
onSubmissionFailureRetries: 10
onSubmissionFailureRetryInterval: 10
Increasing the retry interval helps but the SparkApplication still gets into the PENDING_RERUN
state from time to time
Community Note
What is the outcome that you are trying to reach?
Improve performance of retries after submission failure.
Describe the solution you would like
If there are resources that need to be deleted, delete them, then immediately re-submit in the same reconcile. Currently it doesn't re-submit until the next reconcile.
Describe alternatives you have considered
Keep it as is. Likely this will perform worse.
Additional context
This idea came from discussion during PR review https://github.com/kubeflow/spark-operator/pull/2241#discussion_r1810694623