pangeo-data / pangeo-datastore

Pangeo Cloud Datastore
https://catalog.pangeo.io
48 stars 16 forks source link

STAC and other Prior Art #3

Closed rabernat closed 3 years ago

rabernat commented 5 years ago

@rsignell-usgs mentioned that that are already a lot of standards / catalogs / services etc. in this space. It would be useful to enumerate these so we don't go about re-inventing the wheel too much.

I'll kick things off by linking to @cholmes's blog posts about why they decided to invent STAC:

A STAC Static catalog:

A static catalog is an implementation of the STAC specification that does not respond dynamically to requests - it is simply a set of files on a web server that link to one another in a way that can be crawled. A static catalog can only really be crawled by search engines and active catalogs; it can not respond to queries. But it is incredibly reliable, as there are no moving parts, no clusters or databases to maintain. The goal of STAC is to expose as much asset metadata online as possible, so the static catalog offers a very lower barrier to entry for anyone with geospatial assets to make their data searchable.

rabernat commented 5 years ago

@rsignell-usgs - when the government reopens (or on your "personal time"), it would be great if you could link us to some other standards you think we should be paying attention to.

rabernat commented 5 years ago

THREDDS catalog XML SPEC: https://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogSpec.html

Process for adding new types of data:

You can also use your own scientific file format; send us them and we will add it to this list

https://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogSpec.html#Enumerations

rsignell-usgs commented 5 years ago

@rabernat, okay, here goes!

We've been using the Open Geospatial Consortium's Catalog Service for the Web (CSW) for several years for cataloging here at USGS CMG and also for the IOOS (Integrated Ocean Observing System).

Both the service we use for distributing model output (THREDDS) and the service we use for distributing sensor data (ERDDAP) can generate ISO 19119-2 metadata records on the fly for each dataset, and the collection of ISO metadata records are harvested into pycsw, which we use to provide the CSW service. We can then perform complex query operations using CSW, and find metadata records with embedded data service links.

This allows us to generate workflows that automatically pick up new measurements and models as they become available. I have a short (5 min) lightning talk at SciPy 2016 on "catalog driven workflows" that shows all this:

I have a SciPy lightning talk that describes the whole system in 5 minutes! 😸

The pycsw page is a good place to visit if you want to dig deeper: https://pycsw.org

Here's a simple example in a Jupyter notebook: http://ioos.github.io/notebooks_demos/notebooks/2017-12-15-finding_HFRadar_currents/

We also have a couple of papers demonstrating the power of this approach:

What @apawloski and I were exploring on at the Pangeo Developer's meeting last year was creating ISO metadata records from xarray objects and allow the GCS or S3 datasets supporting them to be discoverable using the same approach.

rabernat commented 5 years ago

Rich unfortunately that youtube video seems corrupted. I can't get it to play.

rabernat commented 5 years ago

So I am trying to take seriously the suggestion that @rsignell-usgs continues to pose: that we catalog our cloud-based datasets via ISO 19119-2 metadata records and OGC CSW services.

The first step towards this is for me to educate myself about the specs themselves. The first hurdle I have hit is that ISO 19119-2 is not free! It costs $200 to even read it. Does anyone have a copy of this document that they can share? It makes me quite uncomfortable to standardize around spec that you have to pay to read. Is this really how things work in ISO world?

OGC CSW is a free and open spec. However, I find it intimidatingly complex. Furthermore, based on my understanding, it describes a query service: to have a OGC CSW catalog, you need a server which provides and API to respond to queries. (Please correct me if I am misinterpreting this.) In contrast, with both THREDDS XML and STAC, the catalog can be a static file.

pycsw seems like a good product. Based on my reading of the docs, it seems like it could provide a CSW-compliant cataloging service. It would be an additional service to stand up (and maintain, and pay for), but it doesn't look too hard. The crux seems to be loading records. I don't understand how we will generate these records from our existing and future cloud datasets.

I suppose that is the exact project that @rsignell-usgs and @apawloski were working on. Do you have any example code that you can point us to for how this might work?

rsignell-usgs commented 5 years ago

@rabernat , I discussed generating ISO records from xarray objects with @kwilcox (Kyle Wilcox) yesterday, and he thought it would be straightforward to implement. He does a lot of python work for IOOS, and he's worked on owslib, which facilitates interaction with CSW.
Kyle, can you give us an assessment of what it would take to implement?

(also I just tried the youtube link and it worked okay for me)

kwilcox commented 5 years ago

I'd recommend against generating ISO records from xarray objects, it will open up too many opinions at the xarray level on how to map things into ISO. I wouldn't take up the task of generating ISO records at all. I would map xarray to the cf-json spec (minus the data) so you get a standardized JSON metadata record that can be created from any xarray object but also by other things (nco supports cf-json as output). Take that cf-json and map it to whatever format your catalog choice uses, adding in the service level metadata information then. I'd be interested in helping out with something like this, we already use cf-json for all of our metadata descriptions and have code to round-trip netCDF4 and cf-json

Regarding the catalog implementation... it depends on what your requirements are. Do you need service side filtering and indexing? Full-text search? How will users interact with the catalog - API? Website? Code? Notebooks? Might be a bigger discussion?

rabernat commented 5 years ago

@kwilcox this is super helpful! I wish I had known about cf-json when I implemented this PR in xarray. It is exactly what I was looking for at the time.

Can you explain the relationship, if any, between cf-json and netcdf-ld?

As for the catalog, for now we are just looking for a static catalog, i.e. a text file that can be parsed by humans and machines. We will eventually want to load it into intake for ingestion in python.

rabernat commented 5 years ago

Also, you might be interested in the parallel discussion happening in https://github.com/radiantearth/stac-spec/pull/361, about how to represent "data cubes" using STAC.

kwilcox commented 5 years ago

I see cf-json as a lossless JSON serialized dataset and json-ld as a lossy JSON serialized dataset for a specific purpose. json-ld could be computed from cf-json, since cf-json always represents the entire dataset. Disclaimer: I have spent very little time in the linked-data world.

dopplershift commented 5 years ago

Just to be clear, netcdf-ld and json-ld are distinct things.

m-mohr commented 5 years ago

Has anybody experience with CovJSON? https://covjson.org/

rsignell-usgs commented 5 years ago

There is a discussion of these various JSON representations of NetCDF/CF metadata here: https://github.com/covjson/specification/issues/86 and note in particular the comment of @BobSimons: https://github.com/covjson/specification/issues/86#issuecomment-318405599 where he argues that CovJSON and nco-json are both useful because they play different roles, but we don't also need cf-json.

kwilcox commented 5 years ago

When I say cf-json I really mean nco-json :grimacing: (which need to be merged - the typing is super important), thanks for clarifying! I agree with@BobSimons. I don't think the argument is if nco-json or covjson or STAC (json), or netcdf-ld are useful formats... they all are... but are any of them a solution to the problem at hand.... a static catalog file describing an N dimensional dataset?

m-mohr commented 5 years ago

As long as the file (must) hold the actual data in the JSON file as it seems to be the case for cf-json or CovJSON, it doesn't seem to be an appropriate catalog (metadata) file. So either STAC (note: I may be biased as a STAC contributor) with the upcoming datacube extension or netcdf-ld sound more appropriate for a catalog file.

kwilcox commented 5 years ago

:+1:, I've started following the STAC datacube extension conversation. Just a note, while cf-json requires the data field, nco-json does not. I'll check in with cj-json about being nco-json compatible and see if we can get those to be the same thing...

lewismc commented 5 years ago

Excellent conversation folks! @jonblower FYI

@m-mohr

As long as the file (must) hold the actual data in the JSON file as it seems to be the case for cf-json or CovJSON, it doesn't seem to be an appropriate catalog (metadata) file.

I am not sure about cf-json but for CovJSON encoding of the actual range array values can be achieved by representing them in a separate document in a more efficient format. Various formats for the range are possible, but one attractive possibility is CovJSON itself, which provides a JSON encoding for a standalone multidimensional array (this can be compressed during data transfer for much greater efficiency). It may also, of course, be possible to use binary formats like NetCDF for this purpose. But many of these formats encode the full coverage (not just the range), and care must be taken to ensure that the RDF representation of the domain is consistent with that in the linked file.

It is my understanding that use cases demonstrating this behavior are not overly common however the CovJSON specification does accommodate this in principle.

lewismc commented 5 years ago

Also folks for more discussion on the CovJSON side you will want to consult http://ceur-ws.org/Vol-1777/paper2.pdf.

jonblower commented 5 years ago

Thanks @lewismc for bringing me in here. I'm one of the authors of the CovJSON spec so happy to answer questions on this. Lewis is right that CovJSON was designed to accommodate the possibility of having metadata and data in separate files (which can be linked of course). The CoverageCollection object might answer @kwilcox's need for "a static catalog file describing an N dimensional dataset".

One thing to bear in mind is the kind of metadata you want in your catalogue. CovJSON contains the same kind of metadata as a NetCDF file, e.g. a detailed description of the domain of the data, including the exact form of all the spatiotemporal axes, CRS definitions, variable definitions etc. Currently it does not contain "summary" metadata, such as the rough spatiotemporal bounding box, which can be useful for discovery. (However such information could be deduced from the CovJSON file.)

talldave commented 4 years ago

Reviving this discussion to see if there are any decisions or further considerations that have arisen since last Feb. I'm building a data catalog now for climate/xarray type data. I'm most familiar with STAC, but have also started dabbling with Intake. And now I need to read up on all the *json systems mentioned above.

Is there any new consensus on how pangeo-data will be implementing data catalog?

rsignell-usgs commented 4 years ago

@kwilcox, are you still advocating for nco-json as the "common metadata format" and then tools to convert nco-json to other metadata conventions like STAC or ISO?

kwilcox commented 4 years ago

I advocate for nco-json as the common metadata representation of an xarray/netcdf4 dataset and for said packages to implement an export function to the format. An additional set of mappings from nco-json into (whatever) catalog format would be required. I originally brought up nco-json because it provides a nice way of abstracting where the dataset object came from. If the problem at hand is creating a catalog of xarray compatible datasets the catalog format could be just as easily (or more easily) computed from an xarray dataset. You would have the benefit of being able to compute spatial/temporal bounds on export at that point... something nco-json can't do without relying on some convention for reading those attributes.

My cataloging requirements are often at a higher level and I need to capture many different access points for the same data package. For example, a zarr data access url, analyses on the data output as static PNG images, a Jupyter notebook showing usage of the data, and a presentation given about the data. nco-json is only going to cover metadata about the first (zarr data access url), the rest are implemented at the catalog spec level (STAC).

rabernat commented 4 years ago

Reviving this discussion to see if there are any decisions or further considerations that have arisen since last Feb.

There have been a lot of developments. To handle the CMIP6 data in the cloud for the CMIP6 hackathon, we hacked together the ESM collection spec: https://github.com/NCAR/esm-collection-spec/

This is not STAC, but it is inspired by STAC. The hope is that we can eventually find a way to merge with STAC.

Right now, the ESM collection spec couples very tightly with intake-esm: https://intake-esm.readthedocs.io/. Some basic usage example can also be found at https://discourse.pangeo.io/t/using-ocean-pangeo-io-for-the-cmip6-hackathon/291

cc @andersy005 and @matt-long, who were instrumental in the development of ESM collection spec.

rabernat commented 4 years ago

Reviving this discussion to see if there are any decisions or further considerations that have arisen since last Feb. I'm building a data catalog now for climate/xarray type data. I'm most familiar with STAC, but have also started dabbling with Intake. And now I need to read up on all the *json systems mentioned above.

@talldave - we have hacked together something called ESM collection spec: https://github.com/NCAR/esm-collection-spec/. It is STAC-like but it is its own thing. We came up with this to catalog the CMIP6 cloud data (https://pangeo-data.github.io/pangeo-datastore/cmip6_pangeo.html). Some flavor of it is implemented by intake-esm

We would love to have more people working on this, and we welcome your involvement.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.