lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications
Apache License 2.0
569 stars 159 forks source link

support for checkpoint-based updates #197

Open davidbirdsong opened 4 years ago

davidbirdsong commented 4 years ago

Our Flink job deploys rely heavily on checkpoints since our savepoints take around 30 -45 minutes to write and read back in on the new job.

It appears that enabling of savepointDisabled gets us part of the way there and that there exists mechanisms for relying on checkpoints to recover a failing job.

We set ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION from within our job and we'd really like to be able to rely only on checkpoints to update jobs.

The way I envision still supporting checkpoints, say for when we need to change parallelism, would be to submit a new job with savepointDisabled disabled such that the next job update would use savepoints.

I'm happy to work on this if a PR would be accepted.

glaksh100 commented 4 years ago

@davidbirdsong Thanks for posting this idea!

I think this would be a cool feature. We support some part of this behavior in the Recovering state, except that in the current state machine, the Recovering state is only ever reached when the savepoint fails. We could re-use some of this behavior to explicitly support a checkpoint based update.

cc/ @mwylde