[FEATURE] Delete old resources and re-submit in a single reconcile when state is `FAILED_SUBMISSION`

Tom-Newton commented 4 weeks ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

Improve performance of retries after submission failure.

Describe the solution you would like

If there are resources that need to be deleted, delete them, then immediately re-submit in the same reconcile. Currently it doesn't re-submit until the next reconcile.

Describe alternatives you have considered

Keep it as is. Likely this will perform worse.

Additional context

This idea came from discussion during PR review https://github.com/kubeflow/spark-operator/pull/2241#discussion_r1810694623

josecsotomorales commented 3 weeks ago

We should also look when the state gets stuck in PENDING_RERUN, usually caused by a cleanup -> submission race condition

josecsotomorales commented 3 weeks ago

I have a nice find on tracking the PENDING_RERUN issue I was getting when restarting failed SparkApplications, I had my retry interval set to 5 seconds, I think there's a race condition where 5 seconds isn't enough, so I increased them to 10 seconds and it seems to fix the issue:

  restartPolicy:
    type: Always
    onFailureRetries: 10
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 10
    onSubmissionFailureRetryInterval: 10

josecsotomorales commented 2 weeks ago

Increasing the retry interval helps but the SparkApplication still gets into the PENDING_RERUN state from time to time

kubeflow / spark-operator