kinow / kinoshita.eti.br

kinow website
https://kinoshita.eti.br
Other
4 stars 2 forks source link

Create a list of models and tools that can be used to test workflow managers #197

Open kinow opened 1 year ago

kinow commented 1 year ago

The idea here is to find at least a couple, maybe three or four, models and/or tools that can be used to create the same workflow in Cylc, ecFlow, Autosubmit, Steep WMS (cyclic), StreamFlow (cyclic w/ CWL dev loops), etc., and in the process take notes of what can be improved in each workflow manager.

At the same time, one of these will be used to produce RO-Crates and validate the Autosubmit RO-Crate implementation, and it will be uploaded to WorkflowHub.eu (https://github.com/ResearchObject/ro-crate-py/issues/148).

The notes about the workflow implementation in different WMSs may be useful to find features that are missing or that could be improved in these WMSs, and at the same time provide a resource for the maintainers of these WMSs if they choose to support different cases (i.e. some WMSs may not be suitable for climate models with ensembles that require restarting/re-running, or to run NWP models with cyclic & with critical operational needs), or if they decide to support RO-Crate.

Requirements

Bonus points for the use case that:

Models and tools

Wave models

Earth System models

Hydrology

Software related to models

Links

RO-Crates

While integrating these models and tools into workflows for different workflow managers, it's possible to take notes on how easy would be for these workflows to be archived as an RO-Crate.

It's clear now that:

  1. Some workflow managers won't have all the necessary (or useful, like authors) data in their configuration and might require extra work to get that information into crates 1.1. That can be solved now with a custom JSON file containing entries compatible with the JSON-LD used to add/update entries in the RO-Crate file - https://github.com/ResearchObject/ro-crate-py/pull/149
  2. Some workflow managers won't have a list of inputs and outputs used in the process encapsulated by the workflow (e.g. Cylc, ecFlow, Autosubmit). 2.1. In cases like this, the approach above might be useful when combined with entries that provide a list of inputs/outputs, maybe using glob patterns like **/*.nc. 2.2. It might be hard or nearly impossible to use BioSchemas FormalParameters as CWL/Galaxy/StreamFlow (these mainly rely on CWL, I think): https://github.com/ResearchObject/ro-crate-py/issues/148#issuecomment-1477563171. So in these cases we can just have a list of inputs & outputs as File and Dataset.
kinow commented 1 year ago

mHM

Will start with mHM since it has great docs, the source is simple & clear, and not being a complete ESM coupled model it should be easier to run it (:crossed_fingers:). They provide two “test domains” that can be executed after mhm is installed and produce some netcdf files that can be plotted with ncview.

So for the RO-Crate file, maybe an Autosubmit + mHM workflow could work. It'd be better if the workflow also prepared data for mHM based on the selected days for the workflow, thus using at least start dates in Autosubmit (no chunking, but not a blocker, I think).

The easiest test scenario would be somewhere in Germany or Europe (as the data mentioned comes from EU agencies). But maybe it'd be possible to use somewhere else like Tamana-shi, Kumamoto, Japan, or Noumea, New Caledonia (or these two).


2023-04-08

The GIS data preparation step is a bit hard to follow, especially if ArcGIS Map is really needed (would be easier with QGIS). So creating the data for another basin looks like a task that demands more time than a few hours every other weekend. Let's see if there's some data ready to be used, and that can be used with different days.


2023-04-09

So; using their test domains, the mhm.nml file has "periods". One appears to be for the training, and the other one for running the model (inference?). The training period must be within the domain of the input data (1980 to 2000, but I think only 1990-2000 can be used).

That can be used, then, to create a workflow that takes as input the dates for these periods (or maybe just for running the model). The output of the workflow would be the outputs of the mHM model (netcdf files and another txt file). Perhaps we could also have an extra task to run ncview and export a plot, also used as output.

All of this can be packed as an RO-Crate (without using FormalParameters), and it should run on any of these WMSs.


2023-04-11

Created a repository for an Autosubmit workflow to run mHM: https://github.com/kinow/auto-mhm-test-domains

It includes the test domain data from 5.12.0, but that will be replaced by a task that clones the repository for v5.12.0 instead, to avoid including data with different license into the git repo. This will be a good test for an RO-Crate with an Autosubmit Project of type Git (that needs to be an input in the workflow).

The LOCAL_SETUP part of the workflow is complete, will continue tomorrow between meetings. But it's looking good, probably a good example for RO-Crate (and for the automated documentation, future feature).