leap-stc / leap-data-library

Data library for LEAP
0 stars 0 forks source link

Evaluate different platforms and tools for core data library infrastructure #1

Open rabernat opened 2 years ago

rabernat commented 2 years ago

The LEAP Data Library will be developed in collaboration with @jhamman and @andersy005 of Carbonplan. The overall description of this component of LEAP-Pangeo is:

The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP. The data library is directly inspired by the IRI Data Library mentioned above; however, LEAP-Pangeo data will be hosted in the cloud, for maximum impact, accessibility, and interoperability.

The contents of the data library will evolve dynamically based on the needs of the project.

This Data Library has to be more than just a bucket of data in the cloud. Data in the library need to be organized into a catalog so that they are easily discoverable and interoperable across the project and broader community.

Requirements

The data library platform should:

Question: do we care about providing a private tier of data access? If so, that will increase the complexity a lot.

Option 1: Use a data library framework

There are many existing open-source frameworks for managing a data library. We should look at them an evaluate whether they can meet our requirements. Here are a few that I think are worth evaluating. Please suggest others if you are aware of them!

Invenio

https://invenio.readthedocs.io/en/latest/

Invenio is the framework that powers Zenodo

Invenio Framework is like a Swiss Army knife of battle-tested, safe and secure modules providing you with all the features you need to build a trusted digital repository.

Globus Modern Research Data Portal

https://docs.globus.org/modern-research-data-portal/

A Design Pattern for Networked, Data-Intensive Science The Modern Research Data Portal is a new design pattern for providing secure, scalable, and high performance access to research data.

Datahub

https://datahubproject.io/

The Metadata Platform for the Modern Data Stack Data ecosystems are diverse — too diverse. DataHub's extensible metadata platform enables data discovery, data observability and federated governance that helps you tame this complexity.

This one is a bit of an outlier, in that it is aimed more at enterprise than academic research. But I really like its core metadata model. One advantage of Datahub is that it is cloud native from the start, so not assuming that we have a bunch of files on a hard disk somewhere. It looks easy to extend the metadata model.

Option 2: Build from Scratch

In this option, we do not use any framework. Instead, we roll our own. Components we would need to develop would include

Criteria for evaluation

For each possible framework, we should try to answer


I would like to see this evaluation completed by Friday May 6 so we can move forward quickly.

andersy005 commented 2 years ago

@rabernat, i just sent you a calendar invite for us to discuss the big picture/roadmap. In the meantime, i'm going to set some time aside to tinker with these different frameworks in option 1 and will report back before our meeting on Friday.

andersy005 commented 2 years ago

Another existing data library framework that is worth looking into is Amundsen from Lyft: https://www.amundsen.io/, https://github.com/amundsen-io/amundsen/

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data.

rabernat commented 2 years ago

Amundsen looks super nice and definitely a little more approachable in terms of deployment than datahub. But it was not immediately obvious to me whether it is extensible? Also, it looks like it is only search, no browse (see roadmap). That might be a limitation.

rabernat commented 2 years ago

Notes from today's meeting

gentine commented 2 years ago

We had discussed indeed cataloging CMIP6, especially taylored to the outreach and knowledge transfer, but not of all models just the most trustworthy ones (as Dave Lawrence framed it). That way we can be regarded as a trustworthy source for climate data.

gentine commented 2 years ago

I wanted to also point to this effort of the Climate Data Guide https://dev-cdg-unity.pantheonsite.io/. Might be good to coordinate.

rabernat commented 2 years ago

This is a great blog post summarizing some of the solutions out there: https://sarahsnewsletter.substack.com/p/choosing-a-data-catalog

andersy005 commented 2 years ago

I recently learnt of another relevant platform: https://github.com/open-metadata/OpenMetadata. OpenMetadata looks quite neat. It is similar to DataHub in some aspects. Here's a nice demo video: https://www.youtube.com/watch?v=V_HkZsMkvho