support for checkpoint-based updates

lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications

Apache License 2.0

569 stars 159 forks source link

Our Flink job deploys rely heavily on checkpoints since our savepoints take around 30 -45 minutes to write and read back in on the new job.

It appears that enabling of savepointDisabled gets us part of the way there and that there exists mechanisms for relying on checkpoints to recover a failing job.

We set ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION from within our job and we'd really like to be able to rely only on checkpoints to update jobs.

The way I envision still supporting checkpoints, say for when we need to change parallelism, would be to submit a new job with savepointDisabled disabled such that the next job update would use savepoints.

I'm happy to work on this if a PR would be accepted.

lyft / flinkk8soperator

support for checkpoint-based updates #197