Flink Native HA on Kubernetes is not supported

lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications

Apache License 2.0

563 stars 159 forks source link

Flink Native HA on Kubernetes is not supported #243

Open borah-hemanga opened 2 years ago

borah-hemanga commented 2 years ago

I tried out the native HA on Kubernetes using the operator.

Here is the general synopsis:

If I start a fresh new deployment, the deployment succeeds and the application comes up perfectly
If I try to perform a deployment (update) of an existing application, then the deployment fails.

The deployment (update) of an existing application goes through the following:

New job manager and task manager pods for the new deployment are created
The old job is canceled on the old pods
The new job manager tries to start the job and prints "Submitting Job with JobId=<>", but fails repeatedly with "The connection was unexpectedly closed by the client."
The old job manager eventually starts a job with a new job id
The new pods are destroyed and the old cluster continues running with the old code

Has anyone been successful in using Native K8s HA with Flink with this FlinkK8sOperator?

nikolasten commented 2 years ago

You will need to change kubernetes.cluster-id config every time you want to deploy a flink app (increment it or take current timestamp) on any FlinkApplication config change. That way when operator starts upgrading and new cluster starts up, it wont try to behave as failover of existing cluster you are running.

I think for operator to support scenario of same kubernetes.cluster-id would need to first shutdown the job that is already running and stop the cluster. And then start the new cluster and deploy the app. Currently its trying to minimize the downtime with having both clusters running during upgrade. Would be nice to have that mode too

anandswaminathan commented 2 years ago

@nikolasten Is it only kubernetes.cluster-id?

anandswaminathan commented 2 years ago

It's here https://github.com/lyft/flinkk8soperator/blob/6264b5a2badba62500a5a7e7f1366493a62fa618/pkg/controller/flink/container_utils.go#L213

nikolasten commented 2 years ago

This is config option for zookeeper only, and not for kuberenetes ha. We did this in our fork to enable it and to make sure its different every time we deploy or upgrade the app. https://github.com/bluelabs-eu/flinkk8soperator/commit/fa64278343aab41a6815343665a342944ccc9510#diff-0e21f32f488d8c4a8aeb58de476274825e4004216515b5bcbcbe0045efe08b00R215-R218

This pr here https://github.com/lyft/flinkk8soperator/pull/170 address the changing of cluster id every time. But it does not add config option for kuberenetes based ha mode.