nsidc / .github

1 stars 1 forks source link

Explore workflow management solutions for data processing #9

Open MattF-NSIDC opened 1 year ago

MattF-NSIDC commented 1 year ago

We currently frequently use Luigi for managing data processing pipelines. Pipelines are represented as DAGs in Python code and are executed on a single host with a configurable number of workers. Luigi is fairly stagnant and feature-light (e.g. its support for retry exists but has been found to be lacking for various projects). We often deploy dedicated VMs to run Luigi workloads, and those VMs sit idle a large portion of the time. It would be better if we had a production cluster of machines that is dedicated to running data workflows, which would enable easier management of processing resources.

I think we should try to avoid a workflow system that requires a cluster, and fully supports local execution for ease of development and testing. When we deploy to production, using a single workflow management system gives us:

Some open-source tools: