eWaterCycle dataset storage systems

nlesc-sigs / data-sig

Linked data, data & modeling SIG

Other

5 stars 3 forks source link

eWaterCycle dataset storage systems #16

Closed sverhoeven closed 6 years ago

sverhoeven commented 6 years ago

For the eWaterCycle 2 project we want hydrologists to be able to

explore simulation input and output data
run models
couple models The coupling and running of models will be described in a Python script.

There are lots of different datasets needed to run a model like terrain elevation, temperature/precipitation over time. Some of them are in netcdf format.

In the project we would like to store datasets in a system that is

searchable for humans and machines
can be accessed via the Python script
used in HPC environment via staging or direct access

In the project we are looking at what generic storage systems could be used and which hydrology specific solutions are out there.

romulogoncalves commented 6 years ago

Some questions:

Is the input data is already FAIR?

The project output is not only data, but also a model. How to make a model FAIR?

FAIRness for models is challenging because it involves FAIR software and FAIR data. Maybe in the end we need see it as FAIR digital objects as specified in the FAIR metrics work. Hence, we should avoid the categorization into either FAIR software or FAIR data.

arnikz commented 6 years ago

Some additional questions to consider. What other file formats are used (besides NetCDF) for sharing (meta)data, models etc. in this domain? What are the usual file sizes? Are the data hierachical or graph-like? For file-based (meta)data management I would recommend iRODS and the Semantic Web/Linked Data approach for (federated) queries using rich/standardized/machine-readable metadata;)

sverhoeven commented 6 years ago

To the models we decided that a module should have a Basic Model Interface (https://github.com/csdms/bmi) and we will ship the models in a Docker image.

The sizes differ based on the scale of the simulation. For example a global model generates 200Gb for a 7 day forecast. Other models will be much smaller.

When we have other storage requirements we will open a new issue.