pangeo-data / pangeo-datastore-stac

STAC implementation of Pangeo Catalog
3 stars 0 forks source link

catalog zarr groups via collection-level assets #4

Closed rabernat closed 3 years ago

rabernat commented 3 years ago

STAC now supports collection-level assets: https://github.com/radiantearth/stac-spec/tree/master/extensions/collection-assets

This would be the appropriate way to catalog our Zarr stores, with each store corresponding to a single STAC collection with one asset.

Some questions we should answer are

charlesbluca commented 3 years ago

Thanks for the update on what's been going on in terms of STAC Zarr representation!

Personally, I think that any attributes that can be quickly obtained from zgroup.attrs.asdict() using Zarr are relevant to duplicate in the collection itself - this could make it easier to populate an online catalog built entirely around STAC to give users relevant information on the data without requiring the dataset being opened.

Looking at the attributes of a random CMIP6 store (gs://cmip6/DCPP/CCCma/CanESM5/dcppA-assim/r10i1p2f1/Amon/clt/gn/):

{'CCCma_model_hash': 'Unknown',
 'CCCma_parent_runid': 'none',
 'CCCma_pycmor_hash': '6278d0dcc93b56f98914b6e02b9a6f29194f6b49',
 'CCCma_runid': 'd2a-asm-e10',
 'Conventions': 'CF-1.7 CMIP-6.2',
 'YMDH_branch_time_in_child': '1958:01:01:00',
 'YMDH_branch_time_in_parent': '1958:01:01:00',
 'activity_id': 'DCPP',
 'branch_method': 'no parent',
 'branch_time_in_child': 39420.0,
 'branch_time_in_parent': 0.0,
 'cmor_version': '3.4.0',
 'contact': 'ec.cccma.info-info.ccmac.ec@canada.ca',
 'coordinates': 'lat_bnds time_bnds lon_bnds',
 'creation_date': '2019-06-12T23:45:56Z',
 'data_specs_version': '01.00.29',
 'experiment': 'Assimilation run paralleling the historical simulation, which may be used to generate hindcast initial conditions',
 'experiment_id': 'dcppA-assim',
 'external_variables': 'areacella',
 'forcing_index': 1,
 'frequency': 'mon',
 'further_info_url': 'https://furtherinfo.es-doc.org/CMIP6.CCCma.CanESM5.dcppA-assim.none.r10i1p2f1',
 'grid': 'T63L49 native atmosphere, T63 Linear Gaussian Grid; 128 x 64 longitude/latitude; 49 levels; top level 1 hPa',
 'grid_label': 'gn',
 'history': '2019-06-12T23:45:56Z ;rewrote data to be consistent with DCPP for variable clt found in table Amon.',
 'initialization_index': 1,
 'institution': 'Canadian Centre for Climate Modelling and Analysis, Environment and Climate Change Canada, Victoria, BC V8P 5C2, Canada',
 'institution_id': 'CCCma',
 'license': 'CMIP6 model data produced by The Government of Canada (Canadian Centre for Climate Modelling and Analysis, Environment and Climate Change Canada) is licensed under a Creative Commons Attribution ShareAlike 4.0 International License (https://creativecommons.org/licenses). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file) and at https:///pcmdi.llnl.gov/. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.',
 'mip_era': 'CMIP6',
 'nominal_resolution': '500 km',
 'parent_activity_id': 'no parent',
 'parent_experiment_id': 'no parent',
 'parent_mip_era': 'no parent',
 'parent_source_id': 'no parent',
 'parent_time_units': 'no parent',
 'parent_variant_label': 'no parent',
 'physics_index': 2,
 'product': 'model-output',
 'realization_index': 10,
 'realm': 'atmos',
 'references': 'Geophysical Model Development Special issue on CanESM5 (https://www.geosci-model-dev.net/special_issues.html)',
 'source': 'CanESM5 (2019): \naerosol: interactive\natmos: CanAM5 (T63L49 native atmosphere, T63 Linear Gaussian Grid; 128 x 64 longitude/latitude; 49 levels; top level 1 hPa)\natmosChem: specified oxidants for aerosols\nland: CLASS3.6/CTEM1.2\nlandIce: specified ice sheets\nocean: NEMO3.4.1 (ORCA1 tripolar grid, 1 deg with refinement to 1/3 deg within 20 degrees of the equator; 361 x 290 longitude/latitude; 45 vertical levels; top grid cell 0-6.19 m)\nocnBgchem: Canadian Model of Ocean Carbon (CMOC); NPZD ecosystem with OMIP prescribed carbonate chemistry\nseaIce: LIM2',
 'source_id': 'CanESM5',
 'source_type': 'AOGCM',
 'sub_experiment': 'none',
 'sub_experiment_id': 'none',
 'table_id': 'Amon',
 'table_info': 'Creation Date:(20 February 2019) MD5:374fbe5a2bcca535c40f7f23da271e49',
 'title': 'CanESM5 output prepared for CMIP6',
 'tracking_id': 'hdl:21.14100/e8f5e5b0-8722-4c0f-8520-a1ec7fc2d061',
 'variable_id': 'clt',
 'variant_label': 'r10i1p2f1',
 'version': 'v20190429',
 'status': '2019-11-13;created;by nhn2@columbia.edu'}

It definitely looks like STAC attributes title, description, providers, and license could be populated using some of the Zarr attributes; this shouldn't be difficult to automate for title or description but will probably require some manual work for providers and license.

rabernat commented 3 years ago

Personally, I think that any attributes that can be quickly obtained from zgroup.attrs.asdict() using Zarr are relevant to duplicate in the collection itself

This is great, but it misses out on something extremely important: the variables!!! When I'm looking at a dataset, the main thing I want is to know what the variables are, and the variable dimensions.

If we happen to have consolidated metadata for the dataset, then all of this is encoded in the .zmetadata key. So maybe we actually don't want to duplicate this metadata in the stac collection and instead just reference the .zmetadata json file directly. At that step, it's only one more step for a client to just fetch than and render it somehow.

A few months back, I started playing around with a vue-based zarr renderer: https://github.com/rabernat/vue_zarr_experiment

rabernat commented 3 years ago

So each zarr store will have a STAC collection associated with it. That collection has an assets field, which points to a single asset. The structure of the asset is described here: https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#asset-object

An asset is an object that contains a link to data associated with the Item that can be downloaded or streamed. It is allowed to add additional fields.

Field Name Type Description
href string REQUIRED. Link to the asset object. Relative and absolute links are both allowed.
title string The displayed title for clients and users.
description string A description of the Asset providing additional details, such as how it was processed or created. CommonMark 0.29 syntax MAY be used for rich text representation.
type string Media type of the asset.
roles [string] The semantic roles of the asset, similar to the use of rel in links.
rabernat commented 3 years ago

roles = ['metadata', 'zarr-consolidated-metadata']

charlesbluca commented 3 years ago

One issue I didn't foresee here is the fact that the majority of the buckets containing Zarr stores are requester pays, and will require some level of authentication to be able to access and crawl the metadata files.

For now, I can populate the collections with the relevant asset files, but we'll need to consider ways to get anonymous access to the files if we want this to work without a service account (the main reason the online catalog worked was because a GCP authenticated account was serving the webpages).

Some potential solutions to this:

charlesbluca commented 3 years ago

Added consolidated metadata under zmetadata asset with #6.