Open davidbirdsong opened 4 years ago
@davidbirdsong Thanks for posting this idea!
I think this would be a cool feature. We support some part of this behavior in the Recovering state, except that in the current state machine, the Recovering state is only ever reached when the savepoint fails. We could re-use some of this behavior to explicitly support a checkpoint based update.
cc/ @mwylde
Our Flink job deploys rely heavily on checkpoints since our savepoints take around 30 -45 minutes to write and read back in on the new job.
It appears that enabling of
savepointDisabled
gets us part of the way there and that there exists mechanisms for relying on checkpoints to recover a failing job.We set
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
from within our job and we'd really like to be able to rely only on checkpoints to update jobs.The way I envision still supporting checkpoints, say for when we need to change parallelism, would be to submit a new job with
savepointDisabled
disabled such that the next job update would use savepoints.I'm happy to work on this if a PR would be accepted.