Is your feature request related to a problem? Please describe.
Using CF 2.0.10 (both in application and CF operator on AKS). Some devs attempted to configure the Kafka topics used by the Flink streamlets by adding Kafka properties in the blueprint. Cloudflow runtime interpretst those Kafka config as external Kafka, which the deployment doesn't have.
As a result, the JobManager pod of the Flink streamlet failed to start. Cloudflow operator then entered in a CrashLoopBackoff when it tried to heal the CF application by recreating the failing pod. The two failures caused a dead loop where the Cloudflow operator is kept in a permanent Run <-> CrashLoopBackoff cycle.
Is your feature request related to a specific runtime of cloudflow or applicable for all runtimes?
Was tested with Flink streamlet, AKS cluster.
Describe the solution you'd like
The CF operator could not create any pod for the application. ie, the deployed CF app could not even start. The only diagnostic info is the logs from the pod running the Cloudflow operator 2.0.10. This logs contains warning and error messages which are ambiguous which led to incorrect interpretation. The developer didn't think of an error in the blueprint config because the app is deployed OK on a cluster running Cloudflow operator 2.0.5
The logs of the CF operator (see attached CF-Operator-PodLog_crashedByBadBlueprint.log) is not obvious. It should be more specific, by hinting that the config error is located in the blueprint. There are too many config in a CF application, An error msg like "Scheduling retry to get resource myapp/mystreamlet1, reason: InvalidConfigurationException" seems to suggest a Kubernete config or other config places.
Is the attached Blueprint_Crashing_CFOperator_2-0-10.conf a valid blueprint? If there is anything wrong in there, could this be detected at the runLocal stage or by the kubectl-cloudflow deploy?
Additional context
blueprint and CF operator logs attached
Is your feature request related to a problem? Please describe. Using CF 2.0.10 (both in application and CF operator on AKS). Some devs attempted to configure the Kafka topics used by the Flink streamlets by adding Kafka properties in the blueprint. Cloudflow runtime interpretst those Kafka config as external Kafka, which the deployment doesn't have.
As a result, the JobManager pod of the Flink streamlet failed to start. Cloudflow operator then entered in a CrashLoopBackoff when it tried to heal the CF application by recreating the failing pod. The two failures caused a dead loop where the Cloudflow operator is kept in a permanent Run <-> CrashLoopBackoff cycle.
Is your feature request related to a specific runtime of cloudflow or applicable for all runtimes? Was tested with Flink streamlet, AKS cluster.
Describe the solution you'd like The CF operator could not create any pod for the application. ie, the deployed CF app could not even start. The only diagnostic info is the logs from the pod running the Cloudflow operator 2.0.10. This logs contains warning and error messages which are ambiguous which led to incorrect interpretation. The developer didn't think of an error in the blueprint config because the app is deployed OK on a cluster running Cloudflow operator 2.0.5
The logs of the CF operator (see attached
CF-Operator-PodLog_crashedByBadBlueprint.log
) is not obvious. It should be more specific, by hinting that the config error is located in the blueprint. There are too many config in a CF application, An error msg like "Scheduling retry to get resource myapp/mystreamlet1, reason: InvalidConfigurationException" seems to suggest a Kubernete config or other config places.Is the attached
Blueprint_Crashing_CFOperator_2-0-10.conf
a valid blueprint? If there is anything wrong in there, could this be detected at therunLocal
stage or by thekubectl-cloudflow deploy
?Additional context blueprint and CF operator logs attached
Blueprint_Crashing_CFOperator_2-0-10.zip