STAC for generic netCDF-type data?

rabernat commented 5 years ago

Hi folks, I'm new here. I'm part of the pangeo project. We are working on a problem adjacent to the STAC community: developing open-source, cloud-native technology for working with generic netCDF-type data*. This might mean regional geospatial data from remote sensing (the closest overlap with COG / STAC), it may be global (think global climate model outputs or level 3/4 satellite products), or it may not be geospatial at all (think idealized turbulence simulations in cartesian geometry).

Our community generally does not use traditional GIS tools. However, as we move more data into cloud storage, we are thinking about many of the same cataloging / discovery issues that you in the STAC community have confronted (see e.g. https://github.com/pangeo-data/pangeo/issues/503). We are very convinced by the idea of static catalogs. The choice we have to make is whether to develop our own static catalog spec inspired by STAC, or whether STAC itself could meet our needs. My main question for you is this: is there scope within STAC itself for more generic datasets, such as netCDF-type data from global climate simulations? The main stumbling block appears to be that some of these datasets may not be able to be described as a GeoJSON feature, which is a fundamental requirement for the item spec.

*I say "netCDF-type data," but we are not necessarily dealing with netCDF/HDF5 files themselves, which are not very cloud friendly at this point. We are excited about a new format called zarr. We have written a bit about our approach to cloud-based data here: http://pangeo.io/data.html#data-in-the-cloud.

Thanks to everyone who volunteers time to this excellent project.

60south commented 5 years ago

Hi @rabernat,

My group is working along parallel lines with yours, although our objective is the rapid distribution of a variety of cloud-based satellite data sets. We had a similar question: would STAC work with HDF or HDF-like files? However, much of our data may be unprojected or sparse, so we've been moving away from netCDF and HDF because they don't handle that content very well. Geotiffs appear to be a non-starter because they depend on the data being projected. We're now evaluating other solutions for storing sparse geographic data.

That said, we may still 'chunk' our data to pre-defined boundary boxes. This technique seems to be compatible with the STAC spec and geoJSON features. We really want to use STAC; we were headed in the geoJSON direction anyway before we even found the STAC specification! We're going to have a lot of electro-optical and provenance metadata that may not yet be in the STAC spec, but our impression is that the format is flexible enough for our custom extensions. Ideally, we may start contributing extension suggestions to STAC.

Back to your question, about whether STAC could be used for generic netCDF-like data... I'd also like to read comments from people who are more familiar with implementing STAC for non-geotiff data formats. It would be interesting to see what others are doing and, perhaps, get some reassurance that we're on the right path.

glenn

cholmes commented 5 years ago

Good to hear from you @rabernat - I've followed pangeo a bit and it's cool to see similar approaches employed for different problem sets. And am psyched you are convinced on static catalogs - fully agree they are quite powerful. And it'd be great to collaborate on that concept. As for your main question:

The choice we have to make is whether to develop our own static catalog spec inspired by STAC, or whether STAC itself could meet our needs. My main question for you is this: is there scope within STAC itself for more generic datasets, such as netCDF-type data from global climate simulations? The main stumbling block appears to be that some of these datasets may not be able to be described as a GeoJSON feature, which is a fundamental requirement for the item spec.

So I think that within STAC there isn't scope for data that can not be referenced both spatially and temporally. Tools should be able to expect to browse and search on space and time - if they can't even count on that then it makes it harder to write STAC tools. But I'd love if we could collaborate more than just 'inspiring' your catalogs, and perhaps work on a spec that would be the 'parent' of both specs?

The rough idea (ignore all the bad naming) would be we'd have an 'Asset Catalog' spec that would define id, links and assets from Items. And I suppose it could be useful to use 'properties' in the same way as GeoJSON / STAC. Then STAC and Pangeo Asset Catalog specs would both 'extend' that same base. And then hopefully you could use our Catalog definition, and maybe even Collections? So we'd use the same structural pieces, so that tooling could parse out assets or links in the same manner. Ideally we'd also use the same 'extension' mechanism (though we are currently having a lively debate on it at #357, but hopefully we get to something solid that we could share).

Then the bigger win I think would be if we could share a set of common fields, which is also discussed in #357. But if there's a set of JSON fields used with catalogs that have the same meaning that would be a big win. We hope to eventually look into JSON-LD, and perhaps link into schema.org - ideally there's a place online that defines what those meanings are.

I don't think we need to like start with a new meta-spec between the two. But I'd say stay in closer touch than if you were just using STAC for 'inspiration' - let us know where you feel you need to diverge and why, and we can try to stay in sync for the shared core fields. And then once we're both a bit more mature we can extract out that 'meta-spec'. Though I think we'd try to just use it as a reference, and repeat the information so we don't have to make people read the meta-spec just to understand core STAC concepts.

Also I think you could have 'mixed' catalogs, where all implement Pangeo Asset Catalog, and those that are spatial implement STAC. There'd be interesting bits on how tools could navigate both, but I'd say we dive into those once we have a few implementations to look at.

cholmes commented 5 years ago

@60south we'd definitely love your extension contributions, and would like to try to make the core flexible enough to meet your needs. And while I do think it is important we define 'STAC' as cataloging the spatiotemporal stuff I could also see a mode in our STAC validators that doesn't 'fail' if you have non-spatial types in your catalog.

Like the ideal would be to define this 'meta-spec', but in the short term we could just say that the Catalog spec can link to non-STAC asset descriptions. So that you could make your catalog today with mixed types (the non-spatial ones), but only the spatial portion would be 'stac compliant'. I'll raise an issue for 0.7.0 to discuss this a bit more.

cholmes commented 5 years ago

Just added #367 to discuss this. Hopefully should make it easier to evolve a Pangeo Asset Catalog that can include STAC for the spatial stuff. And if we do it right it should help anyone experimenting with STAC who has non-spatial data that they want to include. Sound in there to help us with the right mechanism.

60south commented 5 years ago

Hi Chris, thanks for the reply.

And while I do think it is important we define 'STAC' as cataloging the spatiotemporal stuff I could also see a mode in our STAC validators that doesn't 'fail' if you have non-spatial types in your catalog.

I'm a bit confused when you say non-spatial types. To be clear, all of our data is georeferenced and timestamped, and can be described as fitting within a geographic boundary box. I believe it can be described as geoJSON features but perhaps I'm miscalibrated.

Is there any reason why a netCDF or HDF5 file (or zarr, tiledb file, etc.), containing spatially georeferenced and timestamped data that falls within a prescribed boundary box, would not be appropriate as a data asset referenced by a STAC item?

cholmes commented 5 years ago

@60south - Ah, I wasn't sure if all your data was georeferenced and timestamped. I don't know HDF that well, so wasn't sure if 'unprojected' also was implying non-georeferenced, and indeed the thread started with @rabernat asking about data that can not be described by geojson.

But yes, any spatially referenced and timestamped data that falls within a prescribed bounding box is appropriate as a data asset referenced by a STAC Item. I'd say that's our exact scope :) And indeed I'm definitely interested in seeing how it works in practice with more than GeoTIFF - netCDR, HDF5, zarr, etc. for sure are in line with the types of assets we were imagining.

rabernat commented 5 years ago

Thanks for all the suggestions everyone.

I think the path forward for us is this: many of the zarr datasets in our existing catalogs are indeed geospatial (mostly global). So we will try to create STAC item entries for them and see how things play out. This will let us get our hands dirty with STAC. The issue of non-geospatial datasets is a real one, but I think it can be put off until after we first tackle the simpler problem of describing geospatial zarr / netCDF / HDF using the existing STAC spec.

cholmes commented 5 years ago

Sounds great @rabernat - keep us posted. If you want more interactive discussion you can find us chatting occasionally on https://gitter.im/SpatioTemporal-Asset-Catalog/Lobby

And @m-mohr had some great suggestions on #367 so hopefully we'll soon get an easy mechanism to add in links to non-STAC items in STAC catalogs. We're definitely excited to see a nice cloud-native formate like zarr described by STAC! Along with netCDF / HDF in the mix more.

rabernat commented 5 years ago

I am working on creating a STAC item to represent an existing zarr store (located here on GCS), whose xarray representation is the following

<xarray.Dataset>
Dimensions:    (latitude: 720, longitude: 1440, nv: 2, time: 8901)
Coordinates:
    crs        int32 ...
    lat_bnds   (time, latitude, nv) float32 dask.array<shape=(8901, 720, 2), chunksize=(5, 720, 2)>
  * latitude   (latitude) float32 -89.875 -89.625 -89.375 -89.125 -88.875 ...
    lon_bnds   (longitude, nv) float32 dask.array<shape=(1440, 2), chunksize=(1440, 2)>
  * longitude  (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 ...
  * nv         (nv) int32 0 1
  * time       (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
    adt        (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)>
    err        (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)>
    sla        (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)>
    ugos       (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)>
    ugosa      (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)>
    vgos       (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)>
    vgosa      (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)>
Attributes:
    Conventions:                     CF-1.6
    Metadata_Conventions:            Unidata Dataset Discovery v1.0
    cdm_data_type:                   Grid
    comment:                         Sea Surface Height measured by Altimetry...
    contact:                         servicedesk.cmems@mercator-ocean.eu
    creator_email:                   servicedesk.cmems@mercator-ocean.eu
    creator_name:                    CMEMS - Sea Level Thematic Assembly Center
    creator_url:                     http://marine.copernicus.eu
    date_created:                    2014-02-26T16:09:13Z
    date_issued:                     2014-01-06T00:00:00Z
    date_modified:                   2015-11-10T19:42:51Z
    geospatial_lat_max:              89.875
    geospatial_lat_min:              -89.875
    geospatial_lat_resolution:       0.25
    geospatial_lat_units:            degrees_north
    geospatial_lon_max:              359.875
    geospatial_lon_min:              0.125
    geospatial_lon_resolution:       0.25
    geospatial_lon_units:            degrees_east
    geospatial_vertical_max:         0.0
    geospatial_vertical_min:         0.0
    geospatial_vertical_positive:    down
    geospatial_vertical_resolution:  point
    geospatial_vertical_units:       m
    history:                         2014-02-26T16:09:13Z: created by DUACS D...
    institution:                     CLS, CNES
    keywords:                        Oceans > Ocean Topography > Sea Surface ...
    keywords_vocabulary:             NetCDF COARDS Climate and Forecast Stand...
    license:                         http://marine.copernicus.eu/web/27-servi...
    platform:                        ERS-1, Topex/Poseidon
    processing_level:                L4
    product_version:                 5.0
    project:                         COPERNICUS MARINE ENVIRONMENT MONITORING...
    references:                      http://marine.copernicus.eu
    source:                          Altimetry measurements
    ssalto_duacs_comment:            The reference mission used for the altim...
    standard_name_vocabulary:        NetCDF Climate and Forecast (CF) Metadat...
    summary:                         SSALTO/DUACS Delayed-Time Level-4 sea su...
    time_coverage_duration:          P1D
    time_coverage_end:               2017-05-15T00:00:00Z
    time_coverage_resolution:        P1D
    time_coverage_start:             1993-01-01T00:00:00Z
    title:                           DT merged all satellites Global Ocean Gr...

The attributes in this dataset follow some version of the Attribute Convention for Data Discovery (ACDD). The ACDD attributes provide many of the metadata fields that are required for the STAC item spec. For example, geospatial_lon_min, geospatial_lon_max, etc. can easily be translated to a geojson polygon.

However, I'm not sure what to do about the required STAC datetime property. This dataset is a single "item" which contains the entire temporal extent of the dataset. The item spec requires the datetime field to be a single datetime: https://github.com/radiantearth/stac-spec/blob/656cf81283aa7d767b8bcf0004506004e1ff6867/item-spec/json-schema/item.json#L61-L66

This reflects the fact that the primary STAC use case is satellite imagery, for which a single scene is an item. Looking over the various specs, it's starting to feel like this zarr dataset is closer to a STAC collection than it is to an individual item, since it represents a temporal range. However, I can't encode it as a STAC collection either, since it has no items; the individual temporal snapshots are only accessible at the zarr level, and it would make no sense to try to provide direct links to them.

Suggestions on how to move forward?

rabernat commented 5 years ago

Ok, I just found the datetime-range extension. So it looks like I can specify a range here in addition to a single datetime. In this case, I'm still not sure what single datetime to pick, since it's a 24-year long timeseries. Just the midpoint?

rabernat commented 5 years ago

Ok, I have created two STAC files.

an item: https://storage.googleapis.com/pangeo-stac/test/example-item.json

{
  "type": "Feature",
  "id": "dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt",
  "bbox": [-180, -90, 180, 90],
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [[-180, -90], [-180, 90], [180, 90], [180, -90],[-180, -90]]
    ]
  },
  "properties":{
    "datetime":"1993-01-01T00:00:00Z",
    "dtr:start_datetime":"1993-01-01T00:00:00Z",
    "dtr:end_datetime":"2017-05-15T00:00:00Z"
  },
  "links": [
    {
      "rel": "self",
      "href": "https://storage.googleapis.com/pangeo-stac/test/example-item.json"
    },
    {
      "rel": "parent",
      "href": "https://storage.googleapis.com/pangeo-stac/test/example-catalog.json"
    }
  ],
  "assets": {
    "zarr": {
      "href": "https://storage.googleapis.com/pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt/.zmetadata",
      "title": "Zarr consolidated metadata"
    }
  }
}

and a catalog: https://storage.googleapis.com/pangeo-stac/test/example-catalog.json

{
  "stac_version": "0.6.0",
  "id": "pangeo-gcs-master",
  "description": "Master Pangeo Google Cloud Storage Catalog",
  "links": [
    { "rel": "self", "href": "https://storage.googleapis.com/pangeo-stac/test/example-catalog.json" },
    { "rel": "item", "href": "https://storage.googleapis.com/pangeo-stac/test/example-item.json" }
  ]
}

These both validate with the stac-validator utility.

The main outstanding issues are:

What to do about datetime for this sort of full-temporal-extent item? Discussed above.
What sort of assets to use for zarr stores? The fundamental complicating factor is that zarr itself is a compound format. A zarr store can contain many metadata files and dozens to millions of data objects. Zarr works by exploiting a known directory structure and naming convention for these objects. (This is all described in the zarr spec.) I am still having a hard time wrapping my head around how zarr should interact with STAC.

For now I have decided to point to just one asset: the zarr consolidated metadata file. This is all that zarr needs to read in order to understand the full contents of the dataset.

m-mohr commented 5 years ago

Great work so far, it looks quite good.

Regarding the datetime I'd personally propose to use the center datetime, we did it similar for other data with a (in that case shorter) date range.

You can (and I recomment to) use collections together with a single item (or multiple if you can link to individual parts in zarr). Archive like files are maybe a thing we need to discuss more (see Gitter).

rabernat commented 5 years ago

You can (and I recomment to) use collections together with a single item

Can you elaborate the reason for this? My understanding of a collection is that it is a way to share metadata amongst items, thus reducing duplication in the individual item entries. What would be the advantage of using a collection with a single item? Why not just put the metadata in the item itself?

m-mohr commented 5 years ago

Sure! For me (as the main author of the collections specification) a collection is not just to share metadata. That's what the commons extension is made for. I see collections a bit more general as a bigger group of similar smaller data chunks that persons are potentially interested in to use together and the item is one of these chunks. For me, it doesn't really care how they are structured technically, it is about the content. And there must have been a reason you collected and grouped the data for all these years together in a zarr archive. A zarr archive (as far as I understood) is exactly that, a group of similar smaller data chunks. So I'd model it as a collection with all it's shared metadata. If there's only one referenceable file than there's just one item in a collection. Fine for me, but they are still somehow "shared". In addition, collections also hold some additional information that such groups usually declare, e.g. provider, license, description, keywords etc. These are not available for Items as providing them for each Item usually doesn't make sense as these are mass-generated entities by machines. For example, nobody writes individual descriptions and keywords for each Sentinel-2 tile captured.

rabernat commented 5 years ago

It sounds like a collections is, in many ways, isomorphic to a Zarr array or group. Both are ways of describing a aggregation of individual data granules. In Zarr, the granules themselves are chunks. The big difference is that the Zarr granules (raw compressed binary data) are not meant to be addressed on their own and can't be opened without the parent Zarr array object and associated metadata.

Rather than trying to shoehorn Zarr arrays into a STAC item specification, it would be great if we could have a collection point directly to a zarr endpoint. Is is possible to have a collection with no items?

m-mohr commented 5 years ago

Yes, it seems like a STAC collection (or STAC catalog) is very similar to Zarr groups and STAC items are similar to Zarr chunks.

Collections can have additional links and don't necessarily need to link to items. Nevertheless, links are a bit different to assets, which we have for items. Assets are meant to be the data that can be downloaded, links are just references you may follow if interested. So having no assets would imply that there's nothing to download, which is wrong as the whole zarr archive could be downloaded. That's why I suggested to additionally have a very small item with only the required information. But I see the point that an additional Item file sounds bothersome...

mstrahl commented 5 years ago

Hi, I'm new here. I work at the Finnish Meteorological Institute and we are just about to test STAC for some Sentinel-1,2,3 stuff that we or the Finnish Environment Institute have processed to be easy starting points for anyone to do something over Finlands territory. We'll have the data exposed from our own S3 storage. It is in cloud optimized geotiffs with lots of overviews in them.

But I want to comment this issue as our institute is mainly doing its business with numerical weather prediction and related data. @rabernat was excellent in bringing up netcdf-like data in general, but with zarr and single item topics this discussion has missed to address what I see as the main challenge: we have to catalog 5 dimensional data sets and STAC is mostly being tested with items having less. The fifth dimension is having different variables for the point and time in space. So what kind of extension or spec changes do we need for 3 dimensional spatial grids with a timeline of many steps having many different variables per grid point? This is the case for all numerical weather prediction model outputs and these outputs are produced several times a day. We would want to reference each run, but most of the metadata would be common for each run. Input data references and run time would change, but the parameters of the output would all be the same until a model version update or configuration change would occur.

We currently have at FMI for our weather service two types of data and visualization servers to make our services, because the raster/EO world and the numerical model grid world haven't been fitting easily together. For grids we have our own software SmartMet Server and for rasters we are using GeoServers. Ideally these assets should be addressable in the same way for easy service production.

I have applied in Finland for funding to try to see how SmartMet Server assets could be announced in STACs, so we will work on this in 2019 if the funding gods look favorably upon us ;)

m-mohr commented 5 years ago

Welcome, @mstrahl! I (as part of openEO) also face this issue. Not for NetCDF specifically, but for data cubes in general. I just started a data cube extension (see #361), which we'll probably use for it. For now, it just describes the dimensions of the cube with some properties, which is probably what you are also looking for? I just started working on that extension and there's more to be done. So, I'd like to hear your feedback and improve it further.

For STAC collections, which are based on WFS3 collections, we also have the limited alternative to extend the extents property with more extents, but that's semantically not really what we are looking for and doesn't apply for items. Also, it is not clear yet how to model multiple temporal extents, see https://github.com/opengeospatial/WFS_FES/issues/168.

cholmes commented 5 years ago

Yes, welcome @mstrahl! I'll have to dig into more of what you're trying to do to offer any ideas. But it's definitely great to have you involved early, along with @rabernat to push STAC on this. The idea behind STAC was to keep the core flexible enough to handle most any type of asset, enabling a lowest common denominator of space and time search for data. And then to enable communities of interest to figure out extensions that work to expose the information that makes the search more useful.

But if you need changes to the core we are still early enough that we can do that. And we are not fully tied to WFS 3, but we'd ideally push back changes to them.

I agree with @m-mohr that it sounds like a data cubes extension should help. I'd say the main thing to think about in STAC is what level of division is useful for search of data. The goal is not to catalog every possible information product in detail, but to enable users to search and actually find data they can use.

Sorry for speaking vaguely - I've been slammed lately, but hope to find some time to dig into your domain a bit more to help brainstorm how we could evolve STAC to be a good answer. But wanted to encourage your work, as this wrestling with real world implementation is just the type of thing that will make STAC a great spec.

rabernat commented 5 years ago

I thought I would post a link to the THREDDS XML spec: https://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogSpec.html

A THREDDS catalog is a way to describe an inventory of available datasets. These catalogs provide a simple hierarchical structure for organizing a collection of datasets, an access method for each dataset, a human understandable name for each dataset, and a structure on which further descriptive information can be placed.

When I brought up STAC to the netCDF community on the CF mailing list, I was directed to this. It appears to cover much of the same ground as STAC. Despite being XML-based, it may be the case that adapting THREDDS is actually what we need, and adopting a new standard would be counterproductive.

These are things I'm currently mulling over.

mstrahl commented 5 years ago

Hi THREDDS is an implementation of a https://en.wikipedia.org/wiki/OPeNDAP server So the equivalent to STAC is more likely OpenDAP and the DAP protocol specs.

I think there is a discussion to be taken if a variable extent in addition to spatio-temporal extents would be useful to define for core-STAC. At least when one wants to limit the part of a STAC to search in variables in addition to spatio-temporal extents are very common use cases then it comes to numerical model outputs. It is even the primary subset looked for as many users for weather maps want to see the full extent of most model except for global ones. But let us at FMI get into STAC building a little more to make more detailed explanations or suggestions for improvements to STAC.

Cheers Mikko

ma 21. tammik. 2019 klo 12.57 Ryan Abernathey (notifications@github.com) kirjoitti:

I thought I would post a link to the THREDDS XML spec:

https://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogSpec.html

A THREDDS catalog is a way to describe an inventory of available datasets. These catalogs provide a simple hierarchical structure for organizing a collection of datasets, an access method for each dataset, a human understandable name for each dataset, and a structure on which further descriptive information can be placed.

When I brought up STAC to the netCDF community on the CF mailing list, I was directed to this. It appears to cover much of the same ground as STAC. Despite being XML-based, it may be the case that adapting THREDDS is actually what we need, and adopting a new standard would be counterproductive.

These are things I'm currently mulling over.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radiantearth/stac-spec/issues/366#issuecomment-456033139, or mute the thread https://github.com/notifications/unsubscribe-auth/ARceP4qz0rJBJnhCOm6sP6h6gt9VEcuQks5vFZ0pgaJpZM4Zj7gN .

--

Mikko Strahlendorff |* | | o | | strahlen@mikko.net | -- \ _ | | \ |

sjskhalsa commented 5 years ago

OPeNDAP's Hyrax server supports THREDDS catalog functionality, a covJSON encoding option for output, and JSON-LD markup for every browser navigable catalog page and for every dataset/granule OPeNDAP Data Access Form. It is conceivable they might also be interested in developing a STAC-compliant API. Also, OPeNDAP allows creation of virtual data cubes via aggregation, whereby a collection of 2D arrays can be accessed as though it were a single 3D array.

cholmes commented 5 years ago

Thanks for the link @sjskhalsa - Hyrax looks pretty awesome and the creators of it definitely sound like a group we should be talking to. Any chance of introducing us? That's awesome they have JSON-LD markup, as it's one of our main next goals, see #378 I think we'd be interested in learning from them, to help make STAC better.

Would people on this thread be interested in a call between STAC people focused on these 5D issues? @mojodna and I were riffing on ideas yesterday, and I'm sure @m-mohr has some good ones too, and at this point in spec development we still have a good bit of flexibility to question some of our root assumptions about data organization. The goal I'd keep in mind is that we want to help expose the core 'assets' behind services, for cloud tools to access directly, instead of putting everything behind service interfaces. But 5d definitely will require some different relationship options.

Perhaps aim for a call early february? Will give STAC group time to study the 5d data stuff and Hyrax/THREDDS/OPeNDAP/etc, and hopefully give @mstrahl and @rabernat (and any others) time to try out STAC.

Thanks all for contributing thinking on this, it's an exciting topic for sure! And we definitely want STAC to 'fit in', but hopefully play a role in uniting a lowest common denominator for search & web exposure of any spatially + temporally located asset.

m-mohr commented 5 years ago

Yes, I'd happily join the discussions and/or a call about how to model mutli-dimensional datasets. I think I'd like to have another openEO partner in the call. We could also ask one of the openDataCube people (e.g. @omad) to join the call. I assume that's exactly what they also need.

I already started pushing things slightly withing WFS (see https://github.com/opengeospatial/WFS_FES/issues/168) and STAC (see https://github.com/radiantearth/stac-spec/pull/361), but that just tries to get the very basics setteled. The sooner we get things going the better for STAC and also openEO. ;-)

rabernat commented 5 years ago

I have advertised this discussion on the CF conventions mailing list. Hopefully some of those people will chime and and consider attending the call.

rabernat commented 5 years ago

Also, OPeNDAP allows creation of virtual data cubes via aggregation, whereby a collection of 2D arrays can be accessed as though it were a single 3D array.

I believe this involves NetCDF Markup Language (NCLM), which is a generic spec for describing netCDF metadata and aggregations of multiple files. It also overlaps heavily with the STAC goals. I have been using netCDF files for a decade, but I only very recently learned about NCML myself: https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/

sjskhalsa commented 5 years ago

I reached out to James Gallagher (OPeNDAP VP) last week, haven't heard back yet. But will be happy to follow up, and include David Fulker, OPeNDAP Pres. (and former director of UNIDATA), suggesting a meetup.

lesserwhirls commented 5 years ago

I'd be happy to join in on a call. Perhaps @ethanrd as well?

Just as a bit of background on THREDDS (which isn't just the server), just in case it's of any use - Thematic Real-time Environmental Distributed Data Services kicked off as a project funded by NSF in 2001 (Division Of Undergraduate Education), and initially consisted of a client (netCDF-Java) and server (THREDDS Data Server). The project umbrella also includes a few specifications (like THREDDS Client Catalogs, NcML), as well as the python library Siphon, which understands THREDDS Client Catalogs, among other things.

Currently, the specs, like catalog and ncml, all live under the THREDDS repository, which unfortunately means they are mingled with the codebase for netCDF-Java and the TDS. I'd like to break the specs out in the near future.

The THREDDS Data Sever (TDS) implements many types of RESTful services, one of which is a basic level of OPeNDAP capabilities, but also includes OGC services like WMS and WCS, a catalog service (for generating THREDDS client catalogs, and html views of them), and metadata services (like serving out ISO 19115-2 records), thanks to the efforts of others in the community. We've also added support for generating Dataset objects from schema.org encoded in json-ld and embedding those into the html catalog pages. In this case, spatial and temporal metadata are exposed in both the schema.org Dataset objects, as well as the THREDDS client catalog documents.

mwengren commented 5 years ago

I'd like to add my name to the list for a future call. Although I haven't been involved in STAC directly, I've followed @cholmes' Medium posts on it and I know many of you already from the OSGeo community.

I was happy to see @rabernat's post to the CF mailing list on this topic since I think there is some technical overlap between what STAC aims to do and what's been ongoing in the CF community for years (caveat: I'm by no means a CF expert, but I do subscribe to the list mostly as a lurker until the STAC topic came up).

Reading up on STAC a bit, I can see it's definitely sat-focused, but since it aims to include 'any file that represents information about the earth captured in a certain space and time' there's certainly justification for aligning it if possible with existing earth science community standards like CF and ACDD (both linked previously), and also perhaps the CF Standard Names vocabulary (present version 67) although standard name vocabulary may be beyond STAC's scope/purpose.

I see a disconnect between the way some of the DAP-based software works (THREDDS as described above and also ERDDAP that aggregate netCDF/HDF and other data sources together into sometimes very large and infinitely granular data cubes, and what STAC aims to do to allow interoperability between catalogs of more finite data granules. But, I suppose that's what a data cube extension would be for :).

If software like ERDDAP and THREDDS could offer STAC-compliant catalog endpoint(s) to allow search and access to their datasets, seems like a good win. I don't see how the Item concept would translate to existing DAP-based servers like these though. This might be more suitable for data that is chunked in a format like zarr and served via regular HTTP from S3/GCS/etc, rather than via DAP or other higher-level web services that serve multi-dimensional array-oriented data.

Perhaps a data cube in THREDDS/ERDDAP could be algorithmically chunked on the fly into equal-sized STAC Items with corresponding access URLs, but their number would grow exponentially, basically the dimensionality problem @mstrahl mentioned.

ethanrd commented 5 years ago

I'd be happy to join in on a call. Perhaps @ethanrd as well?

Yes, I'd be interested in joining a call as well.

rabernat commented 5 years ago

Thanks to everyone who chimed in here! I know less about this stuff than everyone else here and am basically just trying to play matchmaker.

I am recovering from jet lag right now. Within a day or two, I will send out a poll to try to schedule a call for anyone interested in digging deeper.

rabernat commented 5 years ago

I have created a doodle poll for anyone interested in continuing this discussion via a video call.

https://doodle.com/poll/eamhnih9y2kq5498

The options are for the week of Feb 4-8 and are chosen to be bi-coastal friendly. I will host the call via Zoom.

cholmes commented 5 years ago

Shall we set it for wednesday at 11 pacific? That's the only time I see all greens.

I wasn't expecting so much great participation - was expecting something much smaller. Let's try to cap it at the current people, as I find conversations with bigger groups can be tougher.

@rabernat - you want to catch up sometime on monday or tuesday and flesh out an agenda with me? Can be pretty simple, just want to be sure we make maximal use of everyone's time.

rabernat commented 5 years ago

Ryan Abernathey is inviting you to a scheduled Zoom meeting.

Topic: STAC / Pangeo NetCDF Discussion Time: Feb 6, 2019 2:00 PM Eastern Time (US and Canada)

Join Zoom Meeting https://columbiauniversity.zoom.us/j/838989507

One tap mobile +16468769923,,838989507# US (New York) +16699006833,,838989507# US (San Jose)

Dial by your location +1 646 876 9923 US (New York) +1 669 900 6833 US (San Jose) Meeting ID: 838 989 507 Find your local number: https://zoom.us/u/abIn8fRWIF

@rabernat - you want to catch up sometime on monday or tuesday and flesh out an agenda with me?

Yes definitely. I'll get a draft agenda started and ping you on Monday.

cholmes commented 5 years ago

Sounds great. Ping me on gitter, or email my github username but at planet /dot/ com

rabernat commented 5 years ago

Tentative agenda for today

Introductions (~20 minutes):
- What do you work on? What is your interest in the call? 2-3 minute per person. (This is a good place to explain the tech you work on, e.g. netCDF, THREDDS, opendap, zarr, etc.)
STAC Overview - Chris Holmes (~5 minutes)
Brainstorm (~30 minutes)

Guiding Questions:

I have 1000 netCDF files in cloud storage. How do I catalog them today / tomorrow? I have a several large zarr arrays in cloud storage. How do I catalog them today / tomorrow?

Additional Prompts:

What is the 'static catalog' for netCDF look like?
- STAC / TDS XML / how do they interact, how do you switch between them?
- How do we bridge STAC and netCDF world?
- Mapping between STAC and TDS?
- Hyrex / THREDDS implement STAC API?
- STAC Extension for nD data? What does this look like?
- STAC Browser / Google Dataset Search format?
- Change STAC core data model in some way for many items with one asset
How do we point to a zarr "file" in STAC considering that it is actually a nested directory structure?

Wrap up (~5 minutes)

Next Steps and TODO's

sjskhalsa commented 5 years ago

Possibly relevant: "ability to take nc files and obtain an RDF graph ... [as] a pathway to a schema.org profile." https://github.com/binary-array-ld/bald/issues/80

rsignell-usgs commented 5 years ago

Just cross referencing here some information regarding the use of THREDDS, ISO metadata, CSW and pycsw for cataloging NetCDF-type data and creating automated workflows in the IOOS program: https://github.com/pangeo-data/pangeo-datastore/issues/3

rabernat commented 5 years ago

Rich, I really hope you will be joining us today!

cofinoa commented 5 years ago

Possibly relevant: "ability to take nc files and obtain an RDF graph ... [as] a pathway to a schema.org profile." binary-array-ld/bald#80

https://binary-array-ld.github.io/netcdf-ld/

m-mohr commented 5 years ago

As we have a data cube extension now, I'd say we can close this general issue and work better on specific issues that come up one somebody works with netCDF in STAC. Feel free to re-open or add more comments if necessary.

radiantearth / stac-spec

STAC for generic netCDF-type data? #366