spring-cloud / spring-cloud-dataflow

A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
https://dataflow.spring.io
Apache License 2.0
1.11k stars 580 forks source link

Triggering a spring batch job (that has remote partition steps) using Spring Cloud Data Flow is creating an incorrect data set in Task Execution table. #5738

Closed csumutaskin closed 7 months ago

csumutaskin commented 7 months ago

Description:

We are implementing a multi-pod batch infrastructure using spring batch framework. A sample batch application that uses this architecture is triggered through Spring Cloud Data Flow and Spring Batch application is deployed inside a JRE capable image and execution of batch is done in Open Shift Environment. This infrastructure is also using spring batch remote partitioning mechanism.

We extended the KubernetesTaskLauncher class in order to create pod for each slave step, each slave step is assigned a different task execution id by the SCDF. To give an example for better explanation: assume a task execution id 820110 is assigned for master pod (pod that runs the master step which will create 2 slave paritions – or better to say: remote steps that run in different pods) and these slave pods get the execution ids of 820111 and 820112. When everything goes perfect nothing seems to be strange, however if at least one of the slave pod executions fail, the overall task should be marked as “ERROR”, however SCDF dashboard seems to select the latest running task id (task which has the closest “start time” to NOW) for a specific task label, and if that execution fails, marks the task as “ERROR”. If the task execution of the slave pod that is not having the recent (closest start time to now) execution tuple fails but the latest running slave pod is ending with success, the task is marked as “COMPLETE” although it should still be ERROR.

Release versions: Implementation Version: 2.9.0-SNAPSHOT (spring-cloud-dataflow-server) Core: 2.9.0-SNAPSHOT (Spring Cloud Data Flow Core) Dashboard: 3.2.0-SNAPSHOT (Spring Cloud Dataflow UI) Shell: 2.9.0-SNAPSHOT (Spring Cloud Data Flow Shell)

Runtime Environment - Task Launcher

Steps to reproduce:

Screenshots:

Here is an expected result SS with 2 different tasks for each slave pod of the same batch job application: Expected Result

Here is an unexpected result SS with 2 different tasks for each slave pod of the same batch job application. One already failed but task marked as success: Unexpected Result

cppwfs commented 7 months ago

Have a quick question, in this case do the slave(worker) apps have the same task name ?

csumutaskin commented 7 months ago

Yes, we all named "master + slave task names" identical.

cppwfs commented 7 months ago

In this case the status for a task execution is the last task that was executed. I think what you are looking for can be found in the Job Executions tab. The state of the job execution maybe what you are looking for and can be found under the Job executions tab.

cppwfs commented 7 months ago

Thank you for opening this issue. If you think this issue was closed in error please add a comment.

csumutaskin commented 7 months ago

We renamed the slave tasks different than the master task. But I feel like the remote partitioning of spring batch that are triggered using SCDF is not 100% compatible with each other's nature. There is no warning in any spring documentation to give different task names for remote steps (also tasks for this scenario) and as you say the dashboard always takes into the last task that run when deciding about the status of the overall task, which results with an unexpected behavior on the dashboard as I stated. Anyway thank you, you may close the issue. (you already did, but these were my final comments, thanks)

cppwfs commented 7 months ago

Thank you for your feedback!