ooni / data

OONI Data CLI and Pipeline v5
https://docs.ooni.org/data
8 stars 4 forks source link

Evaluate orchestration systems for ooni/data #46

Closed hellais closed 4 months ago

hellais commented 12 months ago

At the moment OONI/data just runs through crobjob based scheduling.

This is a bit suboptimal because we don't have support for nice logging, retries and monitoring of task execution. It's also not so simple to clearly define task depedencies.

In the past for this use-case we used airflow (and even before that luigi). In the current pipeline we don't use anything, but just rely on systemd because airflow was such a pain to administer and manage.

It looks like the space of orchestration has moved forward quite a bit and there are several nice looking tools in this space at the moment:

It might be worth spending some time evaluation these options and seeing if it makes sense to use them.

If we pick one of these orchestration tools, given that most of them also support parallelization, we could even get rid of dask and replace it with whatever orchestration tool we choose.

That would simplify the codebase and make monitoring and troubleshooting this in production more robust.

ainghazal commented 7 months ago

Not in the same general category as the ones you mention above, but one feature I like from Toil is the support for a workflow DSL (CWL/WDL) and the ability to run the same jobs locally with minimal overhead (file store) or in a fully-fledged HPC (thinking about reproducibility and supporting research with smaller subsets of the global dataset)

hellais commented 4 months ago

I would say we are pretty happy with temporal so we can consider this done.