We currently frequently use Luigi for managing data processing pipelines. Pipelines are represented as DAGs in Python code and are executed on a single host with a configurable number of workers. Luigi is fairly stagnant and feature-light (e.g. its support for retry exists but has been found to be lacking for various projects). We often deploy dedicated VMs to run Luigi workloads, and those VMs sit idle a large portion of the time. It would be better if we had a production cluster of machines that is dedicated to running data workflows, which would enable easier management of processing resources.
I think we should try to avoid a workflow system that requires a cluster, and fully supports local execution for ease of development and testing. When we deploy to production, using a single workflow management system gives us:
Broad observability of multiple workflows from a "single pane of glass"
Ability to more effectively utilize our compute resources
Ability to distribute workloads across multiple machines
We currently frequently use Luigi for managing data processing pipelines. Pipelines are represented as DAGs in Python code and are executed on a single host with a configurable number of workers. Luigi is fairly stagnant and feature-light (e.g. its support for retry exists but has been found to be lacking for various projects). We often deploy dedicated VMs to run Luigi workloads, and those VMs sit idle a large portion of the time. It would be better if we had a production cluster of machines that is dedicated to running data workflows, which would enable easier management of processing resources.
I think we should try to avoid a workflow system that requires a cluster, and fully supports local execution for ease of development and testing. When we deploy to production, using a single workflow management system gives us:
Some open-source tools: