lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications
Apache License 2.0
569 stars 159 forks source link

CRD schema required for object validation in newer Kubernetes releases #253

Open skidder opened 2 years ago

skidder commented 2 years ago

Newer versions of Kubernetes (e.g. 1.20+) perform type-validation of data written to the CRD. The CRD for the Flink operator currently does not define a schema for the status object. This leads to Flink applications never advancing in the state machine, as their updated state cannot be written.

This leads to logs like the following, where the app never advances beyond the CreatingCluster state:

{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Handling state for application","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Jobmanager service already exists","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Logged Normal event: CreatingCluster: Creating Flink cluster for deploy b6178add","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Handling state for application","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Jobmanager deployment already exists","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Jobmanager service already exists","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Versioned Jobmanager service already exists","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Taskmanager deployment already exists","ts":"2022-03-11T04:21:38Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Handling state for application","ts":"2022-03-11T04:22:07Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Jobmanager deployment already exists","ts":"2022-03-11T04:22:07Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Jobmanager service already exists","ts":"2022-03-11T04:22:07Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Versioned Jobmanager service already exists","ts":"2022-03-11T04:22:07Z"}
{"json":{"app_name":"flink-test-app","ns":"aws-us-east-1-dos1","phase":""},"level":"info","msg":"Taskmanager deployment already exists","ts":"2022-03-11T04:22:07Z"}
lydian commented 1 year ago

Not sure what went wrong, but I am seeing the same issue as you do, and it looks like adding the change of #254 doesn't seems to help on the situation.

Wondering if you have any other idea that could potentially related to this issue? Thanks!