lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications
Apache License 2.0
569 stars 159 forks source link

JobManager and TaskManager pods are duplicated after deletion and later addition of FlinkApplication custom resource #160

Closed anekdoti closed 4 years ago

anekdoti commented 4 years ago

I am using v0.4.0 of the flink operator in a Kubernetes 1.15 cluster with different namespaces. In a first step, I removed a FlinkApplication resource in a namespace team1which leads (correctly) to the deletion of the respective jobmanager and taskmanager pods.

After recreating the FlinkApplication resource in the same namespace, the flink operator logs

{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"info","msg":"Handling state for application","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"error","msg":"K8s object creation failed deployments.apps \"flinkApp-job-0d9a57e8-jm\" already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"info","msg":"Jobmanager deployment already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"error","msg":"K8s object creation failed services \"flinkApp\" already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"info","msg":"Jobmanager service already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"error","msg":"K8s object creation failed services \"flinkApp-0d9a57e8\" already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"info","msg":"Versioned Jobmanager service already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"error","msg":"K8s object creation failed ingresses.extensions \"flinkApp\" already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"info","msg":"Jobmanager ingress already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"error","msg":"K8s object creation failed deployments.apps \"flinkApp-0d9a57e8-tm\" already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"info","msg":"Taskmanager deployment already exists","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"error","msg":"K8s object update failed flinkapplications.flink.k8s.io \"flinkApp\" is forbidden: User \"system:serviceaccount:flink-operator:flinkoperator\" cannot update resource \"flinkapplications/status\" in API group \"flink.k8s.io\" in the namespace \"team1\"","ts":"2020-01-15T13:53:25Z"}
{"json":{"app_name":"flinkApp","ns":"team1","phase":""},"level":"warning","msg":"Failed to reconcile resource team1/flinkApp: flinkapplications.flink.k8s.io \"flinkApp\" is forbidden: User \"system:serviceaccount:flink-operator:flinkoperator\" cannot update resource \"flinkapplications/status\" in API group \"flink.k8s.io\" in the namespace \"team1\"","ts":"2020-01-15T13:53:25Z"}

Only one FlinkApplication custom resource is present in the namespace, but the jobmanagers and taskmanagers are duplicated as shown by kubectl get pods -n team1:

NAME                                               READY   STATUS    RESTARTS   AGE
flinkApp-0d9a57e8-jm-5857b767d-t5qss    1/1     Running   0          6m24s
flinkApp-0d9a57e8-tm-b755dc7df-pfrdn    1/1     Running   0          6m23s
flinkApp-0d9a57e8-tm-b755dc7df-qlbzd    1/1     Running   0          6m23s
flinkApp-da1ad18c-jm-6cb4d86958-7rhpk   1/1     Running   0          6m26s
flinkApp-da1ad18c-tm-85787c45cc-d9f5n   1/1     Running   0          6m26s
flinkApp-da1ad18c-tm-85787c45cc-gpbrc   1/1     Running   0          6m26s

I would have expected that only one jobmanager and two taskmanagers are instantiated.

anandswaminathan commented 4 years ago

@anekdoti The new release is not backward compatible. Please check release notes.

Please make sure to update the CRD and this update should not be deployed to a cluster where there are active flinkapplication updates occurring — i.e., all flinkapplications should be in a Running or DeployFailed state.

anekdoti commented 4 years ago

Thank you, @anandswaminathan . It seems that this is actually what happened.

anandswaminathan commented 4 years ago

@anekdoti Cool. Closing it. Reopen if needed.