unity-sds / unity-sps

The Unity SDS Processing Service facilitates large-scale data processing for scientific workflows.
Apache License 2.0
2 stars 2 forks source link

[New Feature]: Stage-In Task #220

Open LucaCinquini opened 2 months ago

LucaCinquini commented 2 months ago

Write a DAG/Task that invokes the DS Docker container to stage-in data from the DS catalog. Eventually this Task needs to be executed as part of the new Application Package CWL DAG, and followed by the Process task.

LucaCinquini commented 1 month ago

Nga provided examples of 2 CWL workflows for stage-in - when data is downloaded from a DAAC or from Unity:

https://github.com/unity-sds/unity-data-services/tree/cwl-examples/cwl

We can start by creating a DAG Task that - depending on what the user selects - will invoke one or the other CWL, staging data to EFS so it can be used by the sub-sequent Process task.

LucaCinquini commented 1 month ago

Suggested steps: o Create a new DAG called cwl_dag_new.py with -initially- the following tasks: o A "setup task" that will expose 2 parameters:

o Depending on input_location, the "cwl_workflow" parameter is set to https://github.com/unity-sds/unity-data-services/blob/cwl-examples/cwl/stage-in-unity/stage-in.cwl or https://github.com/unity-sds/unity-data-services/blob/cwl-examples/cwl/stage-in-daac/stage-in.cwl (use the raw URLs) o I think the "download_dir" parameters can be hardwired to "granules" or "input" or whatever o Then invoke the "cwl_task" which will write to /scratch/granules or /scratch/input o In the "cleanup" task first list the content of the "local_dir" directory, then erase (for now)

Note that the other parameters such as: "unity_client_id" should be retrieved from SSM - see example:

https://github.com/unity-sds/unity-sps/blob/95dad09ea661f4b37b1aa000f29b33c605ded554/airflow/dags/sbg_L1_to_L2_e2e_cwl_step_by_step_dag.py#L290

nikki-t commented 3 weeks ago

Here is a first draft of the stage in task: https://github.com/unity-sds/unity-sps/blob/220-stage-in-task/airflow/dags/cwl_dag_modular.py

LucaCinquini commented 1 week ago

Nikki implemented and demonstrated all functionality, this part of the CWL refactoring is done.

LucaCinquini commented 1 week ago

Re-opening this task as we discussed a new design which involves executing the 3 steps (stage-in, process, stage-out) sequentially within the same shell script, running in the same Docker container. This will guarantee that all 3 tasks have access to the data on a shared EBS volume.

LucaCinquini commented 3 days ago

Examples of 3 sequentials CWL stage-in / process / stage-out workflows provided by Mike:

https://github.com/mike-gangl/unity-OGC-example-application/blob/main/README.md#ogc-run