Closed sethsaperstein-lyft closed 1 year ago
/PTAL @premsantosh @leoluoInSea @maghamravi
The original scope of this change was to handle 2 cases:
The original solution involved an extra state in the state graph for submitting without state. This state would be reached in the above two cases on upgrade as such:
After further review, it appears that case 2 can be solved by enabling the option allowNonRestoredState
. This leaves only the first case to be handled in which the savepoint fails and no externalized checkpoints are available. In this case, we do not need a different state and instead can reuse the state submittingJob
which will proceed without an empty savepointPath
. This solves the issue while simplifying the solution. Consulted @maghamravi who agrees.
The logic looks good. Wondering if the duplication of lines can be avoided. Something like
failDeploy := false if err !=nil { failDeploy = true } else path =="" { ... failDeploy = true } if failDeploy { if FallbackWithoutState { failDeploy = false } } if failDeploy { s.deplpyFailed } submitjob
agreed. refactored
overview
Add CRD parameter
fallbackWithoutState
to be enabled to allow deploys to occur without a savepoint or checkpoint in the case that the savepoint fails and there are no usable externalized checkpoints.In the case that the savepoint and checkpoints are failing, this has a similar effect as savepointDisabled. The benefit of this configuration option is that it only exhibits this behavior when there is no other path forward and thus acts as a fallback without potential intervention. This is disabled by default.
The state graph does not change, but rather, the recovering state that attempts to find an externalized checkpoint forwards to submittingJob in the case that this is enabled and there is no externalized checkpoint