unity-sds / unity-sps

The Unity SDS Processing Service facilitates large-scale data processing for scientific workflows.
Apache License 2.0
2 stars 2 forks source link

Automatically generate a DAG that can execute an arbitrary CWL workflow #92

Open LucaCinquini opened 1 month ago

LucaCinquini commented 1 month ago

Once a user has created a CWL workflow and makes it available at some URL (for example, an Application Package available from Dockstore), we can imagine triggering the OGC register() method to automatically generate a DAG that is very similar to the current generic cwl_dag.py, but includes some customizations: the DAG name, id, and the specific parameters needed by the CWL workflow.

Start simple by generating the "Echo" SAG, which is able to execute this CWL workflow:

https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/main/demos/echo_message.cwl

It should be very similar to this DAG: https://github.com/unity-sds/unity-sps/blob/develop/airflow/dags/cwl_dag.py but customized for the Echo use case.

We can explore either inheriting from a base CWL DAG, or generating the Echo DAG from scratch from a Template.

GodwinShen commented 1 month ago

@jpl-btlunsfo ping for status.

LucaCinquini commented 4 weeks ago

@jpl-btlunsfo : I looked at the article you mentioned, that uses Python Dataclasses to automatically generate DAGs: https://medium.com/cts-technologies/designing-repeatable-dags-in-airflow-part-1-db3a72a2307d

Although it will work, it seems unnecessary complicated to me, and one disadvantage is that the DAGs are saved in the global() scope, and not written to the DAGs folder, which reduces visibility.

I am suggesting to use a simple approach, like the one outlined in this article: https://www.astronomer.io/docs/learn/dynamically-generating-dags?tab=taskflow#example-use-a-create_dag-function

In particular, the second option: "Multiple Files Method". In summary, this would be implemented as follows:

o Create a file "include/dag_template.py" which dynamically creates a DAG based on some input parameters o Implement the OGC register() method which, based on the input request, execute that function to create a file in the DAGs folder which replaces the dag_template.py variables with specific values (like, for example, the DAG name and the CWL file).

The above approach seems much simpler and easier to debug to me.

mike-gangl commented 4 weeks ago

As long as i can execute the CWL with an "arbitrary" json object or link to a json/yaml file like the cwl_dag we have, i'd be very happy in the near term.