lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications
Apache License 2.0
569 stars 159 forks source link

Non-streaming jobs, Beam and Checkpointing #127

Open prkadalb opened 4 years ago

prkadalb commented 4 years ago

Hello,

I'm trying to use the flinkk8soperator with Beam (with Flink being the runner).

The operator is able to launch the Job Manager and Task manager pods and can submit the job as well. It works fine for streaming applications.

However, when I try to run a batch application, it turns out that Beam does not enable checkpointing in Flink.

The k8s operator, however, assumes that checkpointing is turned on, and throws an error as the checkpoint API returns a HTTP 404. https://github.com/lyft/flinkk8soperator/blob/f499e7f2ff5c2f7b2e84e458d08ffdb1df2d22b9/pkg/controller/flink/flink.go#L535

https://github.com/apache/flink/blob/7aafb248770070f0fc1bb2bd49d7bbffbb873699/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/checkpoints/CheckpointingStatisticsHandler.java#L94

https://github.com/apache/beam/blob/7b3a3fa6c9291692b56dbc358dfc075724b993b6/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkExecutionEnvironments.java#L223

Is it possible to let the operator know somehow that checkpoints are not enabled, and that a 404 error on the checkpoint API is not fatal?

Thanks!

anandswaminathan commented 4 years ago

cc @tweise

tweise commented 4 years ago

@anandswaminathan we should support jobs that don't enable checkpointing. There are also streaming use cases where it makes sense to not enable checkpointing.

anandswaminathan commented 4 years ago

@tweise @prkadalb

We can definitely find a way to indicate that. Also I believe there is a small bug with respect to deletion of Finished jobs as well.

What do you think is the best way for the operator to identify that - a job is batch job and that checkpointing is disabled. Also if you have ideas - feel free to submit a PR. @mwylde and I would be happy to review.

tweise commented 4 years ago

https://github.com/lyft/flinkk8soperator/issues/138

tweise commented 4 years ago

Is this still an issue? We are using the operator with a Beam streaming job that does not have checkpointing enabled and also recently added the option to skip savepoint during upgrade: https://github.com/lyft/flinkk8soperator/pull/184