kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

[FEATURE] Delete old resources and re-submit in a single reconcile when state is `FAILED_SUBMISSION` #2285

Open Tom-Newton opened 4 weeks ago

Tom-Newton commented 4 weeks ago

Community Note

What is the outcome that you are trying to reach?

Improve performance of retries after submission failure.

Describe the solution you would like

If there are resources that need to be deleted, delete them, then immediately re-submit in the same reconcile. Currently it doesn't re-submit until the next reconcile.

Describe alternatives you have considered

Keep it as is. Likely this will perform worse.

Additional context

This idea came from discussion during PR review https://github.com/kubeflow/spark-operator/pull/2241#discussion_r1810694623

josecsotomorales commented 3 weeks ago

We should also look when the state gets stuck in PENDING_RERUN, usually caused by a cleanup -> submission race condition

josecsotomorales commented 3 weeks ago

I have a nice find on tracking the PENDING_RERUN issue I was getting when restarting failed SparkApplications, I had my retry interval set to 5 seconds, I think there's a race condition where 5 seconds isn't enough, so I increased them to 10 seconds and it seems to fix the issue:

  restartPolicy:
    type: Always
    onFailureRetries: 10
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 10
    onSubmissionFailureRetryInterval: 10
josecsotomorales commented 2 weeks ago

Increasing the retry interval helps but the SparkApplication still gets into the PENDING_RERUN state from time to time