Open rabernat opened 2 years ago
@rabernat, i just sent you a calendar invite for us to discuss the big picture/roadmap. In the meantime, i'm going to set some time aside to tinker with these different frameworks in option 1 and will report back before our meeting on Friday.
Another existing data library framework that is worth looking into is Amundsen from Lyft: https://www.amundsen.io/, https://github.com/amundsen-io/amundsen/
Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data.
Amundsen looks super nice and definitely a little more approachable in terms of deployment than datahub. But it was not immediately obvious to me whether it is extensible? Also, it looks like it is only search, no browse (see roadmap). That might be a limitation.
Notes from today's meeting
We had discussed indeed cataloging CMIP6, especially taylored to the outreach and knowledge transfer, but not of all models just the most trustworthy ones (as Dave Lawrence framed it). That way we can be regarded as a trustworthy source for climate data.
I wanted to also point to this effort of the Climate Data Guide https://dev-cdg-unity.pantheonsite.io/. Might be good to coordinate.
This is a great blog post summarizing some of the solutions out there: https://sarahsnewsletter.substack.com/p/choosing-a-data-catalog
I recently learnt of another relevant platform: https://github.com/open-metadata/OpenMetadata. OpenMetadata
looks quite neat. It is similar to DataHub
in some aspects. Here's a nice demo video: https://www.youtube.com/watch?v=V_HkZsMkvho
The LEAP Data Library will be developed in collaboration with @jhamman and @andersy005 of Carbonplan. The overall description of this component of LEAP-Pangeo is:
This Data Library has to be more than just a bucket of data in the cloud. Data in the library need to be organized into a catalog so that they are easily discoverable and interoperable across the project and broader community.
Requirements
The data library platform should:
Question: do we care about providing a private tier of data access? If so, that will increase the complexity a lot.
Option 1: Use a data library framework
There are many existing open-source frameworks for managing a data library. We should look at them an evaluate whether they can meet our requirements. Here are a few that I think are worth evaluating. Please suggest others if you are aware of them!
Invenio
https://invenio.readthedocs.io/en/latest/
Invenio is the framework that powers Zenodo
Globus Modern Research Data Portal
https://docs.globus.org/modern-research-data-portal/
Datahub
https://datahubproject.io/
This one is a bit of an outlier, in that it is aimed more at enterprise than academic research. But I really like its core metadata model. One advantage of Datahub is that it is cloud native from the start, so not assuming that we have a bunch of files on a hard disk somewhere. It looks easy to extend the metadata model.
Option 2: Build from Scratch
In this option, we do not use any framework. Instead, we roll our own. Components we would need to develop would include
Criteria for evaluation
For each possible framework, we should try to answer
I would like to see this evaluation completed by Friday May 6 so we can move forward quickly.