Evaluate different platforms and tools for core data library infrastructure

rabernat commented 2 years ago

The LEAP Data Library will be developed in collaboration with @jhamman and @andersy005 of Carbonplan. The overall description of this component of LEAP-Pangeo is:

The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP. The data library is directly inspired by the IRI Data Library mentioned above; however, LEAP-Pangeo data will be hosted in the cloud, for maximum impact, accessibility, and interoperability.

The contents of the data library will evolve dynamically based on the needs of the project.

This Data Library has to be more than just a bucket of data in the cloud. Data in the library need to be organized into a catalog so that they are easily discoverable and interoperable across the project and broader community.

Requirements

The data library platform should:

Understand how to handle Zarr / netCDF style data
"Cloud-native": can point to data on object storage (rather than on disk)
Provide a machine-readable catalog for browsing and searching (e.g. intake, STAC, etc.)
Provide a web interface for browsing and searching
Web interface should have an "admin" feature which allows manually editing the catalog / metadata
Be able to ingest data from Pangeo Forge
Allow users to manually upload data somehow
Provide persistent identifiers for all datasets (DOIs ideal, but we don't have to do that right away)
Integrate with LEAP Hub (e.g. links to automatically open a dataset in the hub)

Question: do we care about providing a private tier of data access? If so, that will increase the complexity a lot.

Option 1: Use a data library framework

There are many existing open-source frameworks for managing a data library. We should look at them an evaluate whether they can meet our requirements. Here are a few that I think are worth evaluating. Please suggest others if you are aware of them!

Invenio

https://invenio.readthedocs.io/en/latest/

Invenio is the framework that powers Zenodo

Invenio Framework is like a Swiss Army knife of battle-tested, safe and secure modules providing you with all the features you need to build a trusted digital repository.

Globus Modern Research Data Portal

https://docs.globus.org/modern-research-data-portal/

A Design Pattern for Networked, Data-Intensive Science The Modern Research Data Portal is a new design pattern for providing secure, scalable, and high performance access to research data.

Datahub

https://datahubproject.io/

The Metadata Platform for the Modern Data Stack Data ecosystems are diverse — too diverse. DataHub's extensible metadata platform enables data discovery, data observability and federated governance that helps you tame this complexity.

This one is a bit of an outlier, in that it is aimed more at enterprise than academic research. But I really like its core metadata model. One advantage of Datahub is that it is cloud native from the start, so not assuming that we have a bunch of files on a hard disk somewhere. It looks easy to extend the metadata model.

Option 2: Build from Scratch

In this option, we do not use any framework. Instead, we roll our own. Components we would need to develop would include

Backend database and schema for datasets, users, etc. etc.
Backend API for search / query (maybe based on elasticsearch)
Front end web application

Criteria for evaluation

For each possible framework, we should try to answer

How many of our requirements does the framework meet "out of the box"?
How difficult is it to deploy the vanilla configuration? (weeks of effort)
For requirements that are not met out of the box, how much effort will it take to extend? (weeks of effort)
How widely used is this framework in the community? Will we be able to get support if we need it?

I would like to see this evaluation completed by Friday May 6 so we can move forward quickly.

andersy005 commented 2 years ago

@rabernat, i just sent you a calendar invite for us to discuss the big picture/roadmap. In the meantime, i'm going to set some time aside to tinker with these different frameworks in option 1 and will report back before our meeting on Friday.

andersy005 commented 2 years ago

Another existing data library framework that is worth looking into is Amundsen from Lyft: https://www.amundsen.io/, https://github.com/amundsen-io/amundsen/

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data.

rabernat commented 2 years ago

Amundsen looks super nice and definitely a little more approachable in terms of deployment than datahub. But it was not immediately obvious to me whether it is extensible? Also, it looks like it is only search, no browse (see roadmap). That might be a limitation.

rabernat commented 2 years ago

Notes from today's meeting

Questions about scope. Are we cataloging only data produced by LEAP. Or data used by LEAP. Is CMIP6 data in scope?
How do you define a schema of an xarray / netcdf dataset?
Do we need to make the data target more than a passive s3 bucket ("data lake" concept)

gentine commented 2 years ago

Questions about scope. Are we cataloging only data produced by LEAP. Or data used by LEAP. Is CMIP6 data in scope?

We had discussed indeed cataloging CMIP6, especially taylored to the outreach and knowledge transfer, but not of all models just the most trustworthy ones (as Dave Lawrence framed it). That way we can be regarded as a trustworthy source for climate data.

gentine commented 2 years ago

I wanted to also point to this effort of the Climate Data Guide https://dev-cdg-unity.pantheonsite.io/. Might be good to coordinate.

rabernat commented 2 years ago

This is a great blog post summarizing some of the solutions out there: https://sarahsnewsletter.substack.com/p/choosing-a-data-catalog

andersy005 commented 2 years ago

I recently learnt of another relevant platform: https://github.com/open-metadata/OpenMetadata. OpenMetadata looks quite neat. It is similar to DataHub in some aspects. Here's a nice demo video: https://www.youtube.com/watch?v=V_HkZsMkvho

leap-stc / leap-data-library