lightbend / cloudflow

Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.
https://cloudflow.io
Apache License 2.0
321 stars 89 forks source link

Improve CF operator error message on blueprint mis-configuration #743

Open tringuyen-yw opened 4 years ago

tringuyen-yw commented 4 years ago

Is your feature request related to a problem? Please describe. Using CF 2.0.10 (both in application and CF operator on AKS). Some devs attempted to configure the Kafka topics used by the Flink streamlets by adding Kafka properties in the blueprint. Cloudflow runtime interpretst those Kafka config as external Kafka, which the deployment doesn't have.

As a result, the JobManager pod of the Flink streamlet failed to start. Cloudflow operator then entered in a CrashLoopBackoff when it tried to heal the CF application by recreating the failing pod. The two failures caused a dead loop where the Cloudflow operator is kept in a permanent Run <-> CrashLoopBackoff cycle.

Is your feature request related to a specific runtime of cloudflow or applicable for all runtimes? Was tested with Flink streamlet, AKS cluster.

Describe the solution you'd like The CF operator could not create any pod for the application. ie, the deployed CF app could not even start. The only diagnostic info is the logs from the pod running the Cloudflow operator 2.0.10. This logs contains warning and error messages which are ambiguous which led to incorrect interpretation. The developer didn't think of an error in the blueprint config because the app is deployed OK on a cluster running Cloudflow operator 2.0.5

Additional context blueprint and CF operator logs attached

Blueprint_Crashing_CFOperator_2-0-10.zip

RayRoestenburg commented 3 years ago

Let's see if this still happens with latest version.

RayRoestenburg commented 3 years ago

(And if we can prevent if with more validation)