lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications
Apache License 2.0
569 stars 159 forks source link

[STRMCMP-1658] Enable Deploys to Fallback without State #287

Closed sethsaperstein-lyft closed 1 year ago

sethsaperstein-lyft commented 1 year ago

overview

Add CRD parameter fallbackWithoutState to be enabled to allow deploys to occur without a savepoint or checkpoint in the case that the savepoint fails and there are no usable externalized checkpoints.

In the case that the savepoint and checkpoints are failing, this has a similar effect as savepointDisabled. The benefit of this configuration option is that it only exhibits this behavior when there is no other path forward and thus acts as a fallback without potential intervention. This is disabled by default.

The state graph does not change, but rather, the recovering state that attempts to find an externalized checkpoint forwards to submittingJob in the case that this is enabled and there is no externalized checkpoint

sethsaperstein-lyft commented 1 year ago

/PTAL @premsantosh @leoluoInSea @maghamravi

sethsaperstein-lyft commented 1 year ago

additional info

The original scope of this change was to handle 2 cases:

  1. Savepoints and checkpoints are failing and we would like to start without state
  2. Submitting job is failing and would otherwise succeed if not using a savepoint (ex: job graph change)

The original solution involved an extra state in the state graph for submitting without state. This state would be reached in the above two cases on upgrade as such:

  1. savepointing -> recovering -> submittingJobWithoutState
  2. savepointing -> submittingJob -> submittingJobWithoutState

After further review, it appears that case 2 can be solved by enabling the option allowNonRestoredState. This leaves only the first case to be handled in which the savepoint fails and no externalized checkpoints are available. In this case, we do not need a different state and instead can reuse the state submittingJob which will proceed without an empty savepointPath. This solves the issue while simplifying the solution. Consulted @maghamravi who agrees.

sethsaperstein-lyft commented 1 year ago

The logic looks good. Wondering if the duplication of lines can be avoided. Something like

failDeploy := false

if err !=nil  {
  failDeploy = true

} else path =="" {
  ...
  failDeploy = true
}

if failDeploy {
  if FallbackWithoutState {
  failDeploy = false
}
}

if failDeploy {
  s.deplpyFailed
}

submitjob

agreed. refactored