design for pangeo cloud datastore service

rabernat commented 5 years ago

People seem generally very pleased with the experience of using zarr + xarray in the cloud environment. However, different groups are uploading more and more zarr data, and it is becoming hard to keep track of what is where.

Currently our approach within the Pangeo NSF-funded group is the following:

Data is prepared locally from netCDF files, written to zarr on local disk, and uploaded to the pangeo-data bucket on GCS (process is described in these docs)
Datasets are recorded in an intake catalog here: https://github.com/pangeo-data/pangeo/blob/master/gce/catalog.yaml. This is pretty ad-hoc, and it is incomplete--not updated regularly.
That catalog is rendered into a pretty web page here: http://pangeo.io/catalog.html (sphinx source template)

I feel that the time has come for a more organized approach. What I have in mind is a standalone "Pangeo Data Store" service that provides the following features:

A consistent master database of all the datasets under its control (could just be an intake catalog)
Some sort of ingest mechanism, such as a staging bucket that people can upload to, separate from the production bucket
Some basic validation of uploaded datasets
A nice web interface to browse the data and perhaps provide metadata to search engines

I don't think this has to be very heavy-handed or complex. (We can likely recycle functionality from existing libraries--intake, pydap, etc.--for lots of this.) But I think it is a problem we have to solve as we move beyond the experimentation phase.

Relevant to many ongoing discussions, e.g. #420, #365, #502.

rabernat commented 5 years ago

@NicWayand suggested we might also want to obtain DOIs for the datasets.

kmpaul commented 5 years ago

I guess this is a related topic, so I'll post it here, but this topic (i.e., something like "Data Management") could be one of the potential Pangeo Technical/Topical Subgroups, right?

I suppose we should wait for the Steering Council to decide what the technical groups should be, but I would say that I and @darothen would be very interested in being a part of this subgroup, were it created. (See #507)

scottyhq commented 5 years ago

Thanks for opening up the discussion here Ryan. Just wanted to draw attention to the STAC effort (SpatioTemporal Asset Catalog). This is aimed at minimal metadata descriptions with the long term vision of having data discoverable by standard web search rather than maintaining specialized servers.

Some implementations are also exploring ways to automatically update catalogs when new data appears in Cloud buckets: https://github.com/fredliporace/cbers-2-stac

We are currently exploring ways to integrate STAC and Intake w/ NASA data: https://github.com/ContinuumIO/intake-datasets/issues/2#issuecomment-444273159

Could be interesting to try this out with the existing pangeo catalog (or just one of the zarr datasets for starters)

rabernat commented 5 years ago

@scottyhq -- great idea to use STAC. This is the exact sort of project that I am hoping to leverage to build our catalogs

Do you think the STAC spec is generic enough to accommodate all the pangeo use cases? What would be a concrete step we could take to evaluate STAC?

rabernat commented 5 years ago

I have created a new repo to hold my ideas and experiments related to a pangeo datastore: https://github.com/pangeo-data/pangeo-stac

jacobtomlinson commented 5 years ago

This is definitely an interesting area for us. I've been exploring the remote catalog functionality in intake, which would be another approach to this.

Whenever we chat about this in our team this xkcd always comes up. We should put some thought into an approach where we add pointers to other catalogs as well as hosting a pangeo one. Perhaps some kind of aggregator.

rabernat commented 5 years ago

@jacobtomlinson - glad to hear you are interested! Your engineering expertise would be very valuable here.

The bottom line for me is that our catalog must

follow some sort of language-independent standard. We can't be python only. I love intake, but it's only part of the solution. There is no spec for intake catalogs, and they're not designed to work outside of the intake library.
have some sort of web front end.

STAC addresses both of these issues. I am confident we will be able translate STAC to intake (see https://github.com/ContinuumIO/intake/issues/224#issuecomment-451964568), but the reverse is not true.

Beyond that, I am still in whiteboarding stage. This is my latest monstrosity. schematic

I see a bot like opsdroid playing a big role in this. It would be great if the catalog could just be static json files in a git repository. But, given the complexity, I don't know what is the right software architecture to manage this. I would love to get your thoughts.

rabernat commented 5 years ago

Also note that STAC catalogs can be nested (like intake ones). This would support our federated philosophy. We could have one "master" pangeo catalog that pointed to any number of sub-catalogs maintained independently by different groups.

jacobtomlinson commented 5 years ago

That sounds like a pretty neat workflow, and something we could definitely implement with opsdroid. STAC also looks good.

From a technical viewpoint that all looks perfectly do-able. My preference is also version controlled static files.

There are other questions I would have from a governance point of view, such as commitments from the Pangeo group (how long should we agree to host data) and the criteria for accepting data. For example we are publishing data MO on AWS Earth and have agreed that AWS will review this arrangement every two years to see if both parties wish to continue. The bot could always open a new issue after a certain amount of time for discussion of dataset retirement.

The other thing would be signposting existing data, like AWS Earth. We could maintain a STAC catalog for our data, which could be pointed to by Pangeo. However what about people who are publishing data but not using STAC?

rabernat commented 5 years ago

Your governance questions are good ones.

In the long term, I don't imagine that "Pangeo" as currently constituted will become a primary data provider. As an organization, we are two amorphous to make that sort of commitment.

Instead, my intended audience for this effort is organizations like NCAR, MO, Lamont, NASA, or companies like ClimaCell, Jupyter Intel, who already host data and are convinced by our arguments that they should be hosting data in the cloud in zarr format. I think we have persuaded many people that this is a good idea in principle, but we don't have a good answer to the question, "now what?" By creating a datastore product, we will have something concrete to point such organizations to.

As for how to pay for it, I imagine that some catalogs will be part of public dataset programs on AWS and GC, while others will be paid by the organizations.

What I want to do here is create a product that will solve the problem of managing these datasets.

jacobtomlinson commented 5 years ago

Ok that's interesting. I'm seeing two different strands in this conversation. One is about creating a way of automatically approving and ingesting new datasets into some cloud provider using a bot and GitHub. The other is about creating some standard tooling for managing datasets (including these new ones) for the Pangeo community. Perhaps it would be good to separate those concerns.

I can also imagine a couple of different use cases. One would be large institute level initiatives to publish data. For example NCAR may want to publish something, they are happy to pay for it, we want them to access it from Pangeo, and we need a way to join up the dots and your catalogging implementation makes sense. The other use case would be an individual researcher (potentially also at NCAR for example) who has some data they want to publish but doesn't have access to the skills or funding to do so. In that case I see your ingestion also being useful.

rabernat commented 5 years ago

I would like to transfer this issue to the new https://github.com/pangeo-data/pangeo-datastore repository. Any objections?

martindurant commented 5 years ago

POC Code in https://github.com/NCAR/intake-siphon/pull/2/files shows that, generally speaking, it is easy to coerce well-defined catalogue protocols to intake, in that case thredds links via siphon (siphon is basically a thin layer which decodes XML, could have been done by hand).

scottyhq commented 5 years ago

Just posted a Binder demo with some initial experiments with STAC, intake, and Landsat data: https://github.com/scottyhq/stac-intake-landsat

An important point is that a STAC catalog is very generic - could describe a global gridded n-D zarr dataset from modeling, or could describe millions of images from various sensors in different projections and resolutions.

I like the idea of pre-configured searches that give back a set of compatible STAC items that can be loaded directly into an xarray DataArray or DataSet. Maybe with sat-api: https://github.com/sat-utils/sat-api

For Landsat it's pretty straightforward to get the temporal stack of colocated images for a specific Path and Row. That is enough for many applications, but if someone is wanting to study a regional process, it would be great to have a way to construct virtual mosaics or composites from many scenes. Do we want intake to do that behind the scenes? Another example, what if someone tries to load Landsat Band1 (30m) and Band8 (15m), the image resolutions are optional metadata in STAC, so intake could just raise a 'Not allowed' error, or the plugin could do the necessary resampling.

Would be great to get feedback from @apawloski on this!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

pangeo-data / pangeo-datastore

design for pangeo cloud datastore service #1