Closed rabernat closed 3 years ago
@rsignell-usgs - when the government reopens (or on your "personal time"), it would be great if you could link us to some other standards you think we should be paying attention to.
THREDDS catalog XML SPEC: https://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogSpec.html
Process for adding new types of data:
You can also use your own scientific file format; send us them and we will add it to this list
https://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogSpec.html#Enumerations
@rabernat, okay, here goes!
We've been using the Open Geospatial Consortium's Catalog Service for the Web (CSW) for several years for cataloging here at USGS CMG and also for the IOOS (Integrated Ocean Observing System).
Both the service we use for distributing model output (THREDDS) and the service we use for distributing sensor data (ERDDAP) can generate ISO 19119-2 metadata records on the fly for each dataset, and the collection of ISO metadata records are harvested into pycsw, which we use to provide the CSW service. We can then perform complex query operations using CSW, and find metadata records with embedded data service links.
This allows us to generate workflows that automatically pick up new measurements and models as they become available. I have a short (5 min) lightning talk at SciPy 2016 on "catalog driven workflows" that shows all this:
I have a SciPy lightning talk that describes the whole system in 5 minutes! 😸
The pycsw page is a good place to visit if you want to dig deeper: https://pycsw.org
Here's a simple example in a Jupyter notebook: http://ioos.github.io/notebooks_demos/notebooks/2017-12-15-finding_HFRadar_currents/
We also have a couple of papers demonstrating the power of this approach:
What @apawloski and I were exploring on at the Pangeo Developer's meeting last year was creating ISO metadata records from xarray objects and allow the GCS or S3 datasets supporting them to be discoverable using the same approach.
Rich unfortunately that youtube video seems corrupted. I can't get it to play.
So I am trying to take seriously the suggestion that @rsignell-usgs continues to pose: that we catalog our cloud-based datasets via ISO 19119-2 metadata records and OGC CSW services.
The first step towards this is for me to educate myself about the specs themselves. The first hurdle I have hit is that ISO 19119-2 is not free! It costs $200 to even read it. Does anyone have a copy of this document that they can share? It makes me quite uncomfortable to standardize around spec that you have to pay to read. Is this really how things work in ISO world?
OGC CSW is a free and open spec. However, I find it intimidatingly complex. Furthermore, based on my understanding, it describes a query service: to have a OGC CSW catalog, you need a server which provides and API to respond to queries. (Please correct me if I am misinterpreting this.) In contrast, with both THREDDS XML and STAC, the catalog can be a static file.
pycsw seems like a good product. Based on my reading of the docs, it seems like it could provide a CSW-compliant cataloging service. It would be an additional service to stand up (and maintain, and pay for), but it doesn't look too hard. The crux seems to be loading records. I don't understand how we will generate these records from our existing and future cloud datasets.
I suppose that is the exact project that @rsignell-usgs and @apawloski were working on. Do you have any example code that you can point us to for how this might work?
@rabernat , I discussed generating ISO records from xarray objects with @kwilcox (Kyle Wilcox) yesterday, and he thought it would be straightforward to implement. He does a lot of python work for IOOS, and he's worked on owslib, which facilitates interaction with CSW.
Kyle, can you give us an assessment of what it would take to implement?
(also I just tried the youtube link and it worked okay for me)
I'd recommend against generating ISO records from xarray
objects, it will open up too many opinions at the xarray
level on how to map things into ISO. I wouldn't take up the task of generating ISO records at all. I would map xarray
to the cf-json
spec (minus the data) so you get a standardized JSON metadata record that can be created from any xarray
object but also by other things (nco
supports cf-json
as output). Take that cf-json
and map it to whatever format your catalog choice uses, adding in the service level metadata information then. I'd be interested in helping out with something like this, we already use cf-json
for all of our metadata descriptions and have code to round-trip netCDF4
and cf-json
Regarding the catalog implementation... it depends on what your requirements are. Do you need service side filtering and indexing? Full-text search? How will users interact with the catalog - API? Website? Code? Notebooks? Might be a bigger discussion?
@kwilcox this is super helpful! I wish I had known about cf-json when I implemented this PR in xarray. It is exactly what I was looking for at the time.
Can you explain the relationship, if any, between cf-json and netcdf-ld?
As for the catalog, for now we are just looking for a static catalog, i.e. a text file that can be parsed by humans and machines. We will eventually want to load it into intake for ingestion in python.
Also, you might be interested in the parallel discussion happening in https://github.com/radiantearth/stac-spec/pull/361, about how to represent "data cubes" using STAC.
I see cf-json
as a lossless JSON serialized dataset and json-ld
as a lossy JSON serialized dataset for a specific purpose. json-ld
could be computed from cf-json
, since cf-json
always represents the entire dataset. Disclaimer: I have spent very little time in the linked-data world.
Just to be clear, netcdf-ld
and json-ld
are distinct things.
Has anybody experience with CovJSON? https://covjson.org/
There is a discussion of these various JSON representations of NetCDF/CF metadata here:
https://github.com/covjson/specification/issues/86
and note in particular the comment of @BobSimons: https://github.com/covjson/specification/issues/86#issuecomment-318405599
where he argues that CovJSON
and nco-json
are both useful because they play different roles, but we don't also need cf-json
.
When I say cf-json
I really mean nco-json
:grimacing: (which need to be merged - the typing is super important), thanks for clarifying! I agree with@BobSimons. I don't think the argument is if nco-json
or covjson
or STAC
(json), or netcdf-ld
are useful formats... they all are... but are any of them a solution to the problem at hand.... a static catalog file describing an N dimensional dataset?
As long as the file (must) hold the actual data in the JSON file as it seems to be the case for cf-json or CovJSON, it doesn't seem to be an appropriate catalog (metadata) file. So either STAC (note: I may be biased as a STAC contributor) with the upcoming datacube extension or netcdf-ld sound more appropriate for a catalog file.
:+1:, I've started following the STAC datacube extension conversation. Just a note, while cf-json
requires the data
field, nco-json
does not. I'll check in with cj-json
about being nco-json
compatible and see if we can get those to be the same thing...
Excellent conversation folks! @jonblower FYI
@m-mohr
As long as the file (must) hold the actual data in the JSON file as it seems to be the case for cf-json or CovJSON, it doesn't seem to be an appropriate catalog (metadata) file.
I am not sure about cf-json but for CovJSON encoding of the actual range array values can be achieved by representing them in a separate document in a more efficient format. Various formats for the range are possible, but one attractive possibility is CovJSON itself, which provides a JSON encoding for a standalone multidimensional array (this can be compressed during data transfer for much greater efficiency). It may also, of course, be possible to use binary formats like NetCDF for this purpose. But many of these formats encode the full coverage (not just the range), and care must be taken to ensure that the RDF representation of the domain is consistent with that in the linked file.
It is my understanding that use cases demonstrating this behavior are not overly common however the CovJSON specification does accommodate this in principle.
Also folks for more discussion on the CovJSON side you will want to consult http://ceur-ws.org/Vol-1777/paper2.pdf.
Thanks @lewismc for bringing me in here. I'm one of the authors of the CovJSON spec so happy to answer questions on this. Lewis is right that CovJSON was designed to accommodate the possibility of having metadata and data in separate files (which can be linked of course). The CoverageCollection object might answer @kwilcox's need for "a static catalog file describing an N dimensional dataset".
One thing to bear in mind is the kind of metadata you want in your catalogue. CovJSON contains the same kind of metadata as a NetCDF file, e.g. a detailed description of the domain of the data, including the exact form of all the spatiotemporal axes, CRS definitions, variable definitions etc. Currently it does not contain "summary" metadata, such as the rough spatiotemporal bounding box, which can be useful for discovery. (However such information could be deduced from the CovJSON file.)
Reviving this discussion to see if there are any decisions or further considerations that have arisen since last Feb. I'm building a data catalog now for climate/xarray type data. I'm most familiar with STAC, but have also started dabbling with Intake. And now I need to read up on all the *json systems mentioned above.
Is there any new consensus on how pangeo-data will be implementing data catalog?
@kwilcox, are you still advocating for nco-json as the "common metadata format" and then tools to convert nco-json
to other metadata conventions like STAC or ISO?
I advocate for nco-json
as the common metadata representation of an xarray
/netcdf4
dataset and for said packages to implement an export function to the format. An additional set of mappings from nco-json
into (whatever) catalog format would be required. I originally brought up nco-json
because it provides a nice way of abstracting where the dataset object came from. If the problem at hand is creating a catalog of xarray
compatible datasets the catalog format could be just as easily (or more easily) computed from an xarray
dataset. You would have the benefit of being able to compute spatial/temporal bounds on export at that point... something nco-json
can't do without relying on some convention for reading those attributes.
My cataloging requirements are often at a higher level and I need to capture many different access points for the same data package. For example, a zarr data access url, analyses on the data output as static PNG images, a Jupyter notebook showing usage of the data, and a presentation given about the data. nco-json
is only going to cover metadata about the first (zarr data access url), the rest are implemented at the catalog spec level (STAC).
Reviving this discussion to see if there are any decisions or further considerations that have arisen since last Feb.
There have been a lot of developments. To handle the CMIP6 data in the cloud for the CMIP6 hackathon, we hacked together the ESM collection spec: https://github.com/NCAR/esm-collection-spec/
This is not STAC, but it is inspired by STAC. The hope is that we can eventually find a way to merge with STAC.
Right now, the ESM collection spec couples very tightly with intake-esm: https://intake-esm.readthedocs.io/. Some basic usage example can also be found at https://discourse.pangeo.io/t/using-ocean-pangeo-io-for-the-cmip6-hackathon/291
cc @andersy005 and @matt-long, who were instrumental in the development of ESM collection spec.
Reviving this discussion to see if there are any decisions or further considerations that have arisen since last Feb. I'm building a data catalog now for climate/xarray type data. I'm most familiar with STAC, but have also started dabbling with Intake. And now I need to read up on all the *json systems mentioned above.
@talldave - we have hacked together something called ESM collection spec: https://github.com/NCAR/esm-collection-spec/. It is STAC-like but it is its own thing. We came up with this to catalog the CMIP6 cloud data (https://pangeo-data.github.io/pangeo-datastore/cmip6_pangeo.html). Some flavor of it is implemented by intake-esm
We would love to have more people working on this, and we welcome your involvement.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.
@rsignell-usgs mentioned that that are already a lot of standards / catalogs / services etc. in this space. It would be useful to enumerate these so we don't go about re-inventing the wheel too much.
I'll kick things off by linking to @cholmes's blog posts about why they decided to invent STAC:
A STAC Static catalog: