Flink streamlets are not updated when Flink application changes

lightbend / cloudflow

Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.

https://cloudflow.io

Apache License 2.0

321 stars 90 forks source link

Flink streamlets are not updated when Flink application changes #561

Open DarthKrab opened 4 years ago

DarthKrab commented 4 years ago

I have a problem changing the flink application in my pipeline. The problem is this: for example, I want to change the amount of memory for the task мanager in one of the job clusters. I change the default value in flink application by the spec.taskManagerConfig.resources.requests.memory. After that new deployments are created in namespace http://joxi.ru/GrqkRYvck3xEQ2 that duplicate existing ones. In the logs flink-operator : https://paste.ubuntu.com/p/Dfys67smSN/ Accordingly, the new streamlets are deleted and my changes are not applied.

However, for some job clusters, no new deployment is created when updating the flink application. Just create a new version of the current deployment http://joxi.ru/n2YowdZTZ3y88r . And in this case everything is updated correctly

I use Cloudflow 1.3.3, Openshift 3.11

RayRoestenburg commented 4 years ago

Hi @DarthKrab Cloudflow 2.0 has significantly better support for what you are trying to do: https://cloudflow.io/docs/current/develop/cloudflow-configuration.html

You can now pass resources requirements in a config file using --conf file in kubectl cloudflow deploy and kubectl cloudflow configure. The configuration can apply per specific streamlet or for all streamlets of a runtime (Flink for example).

Using the new configuration model, you can configure the following settings:

Runtime settings (for Akka, Spark, Flink, or any user-provided one)
Kubernetes container resource requirements and environment variables
Streamlet Configuration Parameters for a particular instance

Is it possible for you to try v2.0.5? There is a migration guide here: https://cloudflow.io/docs/current/project-info/migration-1_3-2_0.html

Lockdain commented 4 years ago

Hi @RayRoestenburg thanks for your reply.

We have to make a critical decision considering a migration to 2.0.5 and possible preliminary experiments and hypothesis checks on it, so it is extremely important for us to make sure that the new configuration mechanics works as designed, especially for any Flink-related resources. Do you aware if anybody successfully tried to tune Flink's memory parameters using Cloudflow 2.0.5 or it's better to check this thoroughly before the migration?

RayRoestenburg commented 4 years ago

We have tested that changing the memory requirements works for Flink streamlets.

On Mon, Jul 6, 2020 at 1:22 PM Alex Sergeenko notifications@github.com wrote:

Hi @RayRoestenburg https://github.com/RayRoestenburg thanks for your reply.

We have to make a critical decision considering a migration to 2.0.5 and possible preliminary experiments and hypothesis checks on it, so it is extremely important for us to make sure that the new configuration mechanics works as designed, especially for any Flink-related resources. Do you aware if anybody successfully tried to tune Flink's memory parameters using Cloudflow 2.0.5 or it's better to check this thoroughly before the migration?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lightbend/cloudflow/issues/561#issuecomment-654173767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGGCMXTBTDLFEB4FUN2PTR2GXXBANCNFSM4ORLL7IQ .

-- Cloudflow Tech Lead, Lightbend, Inc. ray@lightbend.com raymond@lightbend.com @RayRoestenburg https://twitter.com/RayRoestenburg

https://www.lightbend.com/

Lockdain commented 4 years ago

We have tested that changing the memory requirements works for Flink streamlets. … On Mon, Jul 6, 2020 at 1:22 PM Alex Sergeenko @.**> wrote: Hi @RayRoestenburg https://github.com/RayRoestenburg thanks for your reply. We have to make a critical decision considering a migration to 2.0.5 and possible preliminary experiments and hypothesis checks on it, so it is extremely important for us to make sure that the new configuration mechanics works as designed, especially for any Flink-related resources. Do you aware if anybody successfully tried to tune Flink's memory parameters using Cloudflow 2.0.5 or it's better to check this thoroughly before the migration? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#561 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGGCMXTBTDLFEB4FUN2PTR2GXXBANCNFSM4ORLL7IQ . -- Cloudflow Tech Lead, Lightbend, Inc.* ray@lightbend.com raymond@lightbend.com @RayRoestenburg https://twitter.com/RayRoestenburg https://www.lightbend.com/

Hi @RayRoestenburg

Great thanks for the assistance!

RayRoestenburg commented 4 years ago

You’re welcome!

On Mon, 13 Jul 2020 at 08:06, Alex Sergeenko notifications@github.com wrote:

We have tested that changing the memory requirements works for Flink streamlets. … <#m-2827871021952755181> On Mon, Jul 6, 2020 at 1:22 PM Alex Sergeenko @.**> wrote: Hi @RayRoestenburg https://github.com/RayRoestenburg https://github.com/RayRoestenburg thanks for your reply. We have to make a critical decision considering a migration to 2.0.5 and possible preliminary experiments and hypothesis checks on it, so it is extremely important for us to make sure that the new configuration mechanics works as designed, especially for any Flink-related resources. Do you aware if anybody successfully tried to tune Flink's memory parameters using Cloudflow 2.0.5 or it's better to check this thoroughly before the migration? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#561 (comment) https://github.com/lightbend/cloudflow/issues/561#issuecomment-654173767>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGGCMXTBTDLFEB4FUN2PTR2GXXBANCNFSM4ORLL7IQ . -- Cloudflow Tech Lead, Lightbend, Inc.* ray@lightbend.com raymond@lightbend.com @RayRoestenburg https://github.com/RayRoestenburg https://twitter.com/RayRoestenburg https://www.lightbend.com/

Hi @RayRoestenburg https://github.com/RayRoestenburg

Great thanks for the assistance!

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lightbend/cloudflow/issues/561#issuecomment-657378106, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGGCK3YO3UFLPFCRPB7BLR3KP5JANCNFSM4ORLL7IQ .

-- Cloudflow Tech Lead, Lightbend, Inc. ray@lightbend.com raymond@lightbend.com @RayRoestenburg https://twitter.com/RayRoestenburg

https://www.lightbend.com/

DarthKrab commented 4 years ago

Hi! Unfortunately, this problem exists in the new version of cloudflow (2.0.5). Pods of the new version of pipeline are created, live for 5 minutes and then deleted. Log of the Flink operator below: not_update.log

There is no information about these events in the cloudflow operator log.

Immediately after installing the new version of the platform, the pipeline update work correctly. But as the number of pipelines in the cluster increases, the problem reappears.

There are no problems with resources in the cluster (cpu, ram). I undeploy all the pipelines (using kubectl-cloudflow undeploy ) and install them again. The problem remains.

One more thing but I'm not sure it's relevant. During installation of the pipeline, we made an error in the name of the topic kafka. After deploying the pipeline, cloudflow-operator failed with the error below: err_1.log

I successfully re-created the pod with cloudflow-operator. After that, we noticed the problem for the first time.

RayRoestenburg commented 4 years ago

Sorry maybe I should have been more clear in my response, did you use the new configuration feature in 2.0? Please see https://cloudflow.io/docs/current/develop/cloudflow-configuration.html https://cloudflow.io/docs/current/develop/cloudflow-configuration.html#_configuring_streamlets_using_the_streamlet_scope

https://cloudflow.io/docs/current/develop/cloudflow-configuration.html#_configuring_a_runtime_using_the_runtime_scope

On Mon, 20 Jul 2020 at 15:54, DarthKrab notifications@github.com wrote:

Hi! Unfortunately, this problem exists in the new version of cloudflow (2.0.5). Pods of the new version of pipeline are created, live for 5 minutes and then deleted. Log of the Flink operator below: not_update.log https://github.com/lightbend/cloudflow/files/4947705/not_update.log

There is no information about these events in the cloudflow operator log.

Immediately after installing the new version of the platform, the pipeline update work correctly. But as the number of pipelines in the cluster increases, the problem reappears.

There are no problems with resources in the cluster (cpu, ram). I undeploy all the pipelines (using kubectl-cloudflow undeploy ) and install them again. The problem remains.

One more thing but I'm not sure it's relevant. During installation of the pipeline, we made an error in the name of the topic kafka. After deploying the pipeline, cloudflow-operator failed with the error below: err_1.log https://github.com/lightbend/cloudflow/files/4948017/err_1.log

I successfully re-created the pod with cloudflow-operator. After that, we noticed the problem for the first time.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lightbend/cloudflow/issues/561#issuecomment-661054478, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGGCMY2NGMITN57UL2INLR4REBVANCNFSM4ORLL7IQ .

-- Cloudflow Tech Lead, Lightbend, Inc. ray@lightbend.com raymond@lightbend.com @RayRoestenburg https://twitter.com/RayRoestenburg

https://www.lightbend.com/

DarthKrab commented 4 years ago

Тhanks for the answer! As far as I understand "kubectl-cloudflow configure" is used to update the pipeline configuration. But if I don't need to change the pipeline configuration, how do I correctly build the pipeline code update process?

I have a running pipeline ("kubectl-cloudflow deploy" is used for installation). Let's say the developer updated the streamlet code. My task is to update the current pipeline. I build the project and get the message:

[success] Use the following command to deploy the Cloudflow application: [success] kubectl cloudflow deploy /opt/home/build-dir/JOB/target/pipeline.json"

What is the right thing to do next?

RayRoestenburg commented 4 years ago

You can use —conf with deploy as well

On Wed, 22 Jul 2020 at 13:37, DarthKrab notifications@github.com wrote:

Тhanks for the answer! As far as I understand "kubectl-cloudflow configure" is used to update the pipeline configuration. But if I don't need to change the pipeline configuration, how do I correctly build the pipeline code update process?

I have a running pipeline ("kubectl-cloudflow deploy" is used for installation). Let's say the developer updated the streamlet code. My task is to update the current pipeline. I build the project and get the message:

[success] Use the following command to deploy the Cloudflow application: [success] kubectl cloudflow deploy /opt/home/build-dir/JOB/target/pipeline.json"

What is the right thing to do next?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lightbend/cloudflow/issues/561#issuecomment-662404006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGGCPYZTZDG3BMESG7ER3R43FQZANCNFSM4ORLL7IQ .

-- Cloudflow Tech Lead, Lightbend, Inc. ray@lightbend.com raymond@lightbend.com @RayRoestenburg https://twitter.com/RayRoestenburg

https://www.lightbend.com/

DarthKrab commented 4 years ago

Great. I'm executing a command: kubectl cloudflow deploy /opt/home/build-dir/JOB/target/pipeline.json --conf pipeline.conf

After that, new streamlets are created in the pipeline project's namespace. In the cloudflow-flink-operator log I see: "msg":"Application resource has changed. Moving to Updating"

After 5 minutes, the new pods go to the Terminating status and the cluster deletes them. Old pods remain (podas of the old version of pipeline). In the cloudflow-flink-operator log I see: "msg":"Logged Warning event: ClusterCreationFailed: Flink cluster failed to become available: failed to make progress after 5m0s" "msg":"Logged Warning event: RolledBackDeploy: Successfully rolled back deploy f396fdc7"

The problem is this. Full log:

not_update.log

RayRoestenburg commented 4 years ago

FYI We have also seen this occur, for some reason the Flink operator cannot always advance, and then it falls back to the previous Flink cluster. The 5 min can be explained by this setting: https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/config/config.go#L26