radiantearth / stac-spec

SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
https://stacspec.org
Apache License 2.0
772 stars 176 forks source link

Refine the collection concept #81

Closed cholmes closed 5 years ago

cholmes commented 6 years ago

See discussion in #63

Should contain the information about a set of Items that share metadata and assets.

Should include the collection level metadata so the Items don't have to repeat so much.

Should include the json-schema to validate (or probably a link to it?)

May be the spot we put DCAT type metadata.

cholmes commented 6 years ago

Not sure yet how this fits in vis a vis the catalog.json. A catalog is the entry point to crawl a set of data. And I think a catalog could be heterogenous - could have multiple collections in it.

m-mohr commented 6 years ago

I saw us mentioned in #63, thanks for the reminder and congratulations for the first release of STAC.

Update 2018-04-26: After the discussion in #105 I have rewritten this text.

A short overview about my intention: At openEO we are looking for a standard to adopt to make meta data about image collections or other assets available to our users for processing. The main issue we currently have with STAC is this definition from the catalog:

All static catalogs must contain at least 1 Asset, as the point of the SpatioTemporal Asset Catalog is to be link to actual actual data, not to just reference metadata (though it is not required that all users have permissions to access the asset).

We don't want to require to list their assets as some might not want to make their RAW data available directly. I mean with the growing amount of EO data processing is moving more and more into the cloud where you don't need to download the data at all, but still need metadata descriptions for the data you are processing on. That's why I have no link with rel="item" in my examples. That's probably not valid with STAC? Is this fixed? Anyway, STAC looks promising to us regarding the meta data with the introduction of collections (and catalogs). Therefore, I support the idea to have collections and also to merge them with the catalogs.

To get a first feeling for the current specification, I'll just try to convert a reduced example from our current specification for image collections to STAC.

openEO data specification v0.0.2 example

A specific image collection (data cube) is currently defined as follows:

{
  "data_id": "Sentinel-2A-L1C",
  "description": "Sentinel 2 Level-1C: Top-of-atmosphere reflectances in cartographic geometry",
  "source": "European Space Agency (ESA)",
  "extent": {
    "srs": "EPSG:4326",
    "left": -180,
    "right": 180,
    "bottom": -85,
    "top": 85
  },
  "time": {
    "from": "2016-01-01",
    "to": "2017-10-01"
  },
  "bands": [
    {
      "band_id": "1",
      "wavelength_nm": 443.9,
      "res_m": 60,
      "scale": 0.0001,
      "offset": 0,
      "type": "int16",
      "unit": "1"
    },
    {
      "band_id": "2",
      "name": "blue",
      "wavelength_nm": 496.6,
      "res_m": 10,
      "scale": 0.0001,
      "offset": 0,
      "type": "int16",
      "unit": "1"
    },
    {
      "band_id": "3",
      "name": "green",
      "wavelength_nm": 560,
      "res_m": 10,
      "scale": 0.0001,
      "offset": 0,
      "type": "int16",
      "unit": "1"
    }
  ]
}

I limited the number of bands for a more compact example.

There are Information we want to add (=> inspired from STAC), but which is still missing in our example:

The data converted to STAC

http://www.openeo.org/api/v0/data:

{
  "name":"openEO R back-end",
  "description":"A catalog listing all data sets available for an openEO back-end (R in this case).",
  "links":[
    {
      "rel":"self",
      "href":"http://www.openeo.org/api/v0/data"
    },
    {
      "rel":"root",
      "href":"http://www.openeo.org/api/v0/data"
    },
    {
      "rel":"child",
      "href":"http://www.openeo.org/api/v0/data/Sentinel-2A-L1C",
      "title":"Sentinel-2A-L1C"
    }
  ],
  "contact":{
    "name":"openEO",
    "url":"https://www.openeo.org"
  },
  "homepage":"https://www.openeo.org"
}

http://www.openeo.org/api/v0/data/Sentinel-2A-L1C:

{
  "name":"Sentinel-2A-L1C",
  "description":"The Sentinel Level-1C product is composed of 100x100 km2 tiles (ortho-images in UTM/WGS84 projection). The Level-1C product results from using a Digital Elevation Model (DEM) to project the image in cartographic geometry. ...",
  "links":[
    {
      "rel":"self",
      "href":"http://www.openeo.org/api/v0/data/Sentinel-2A-L1C"
    },
    {
      "rel":"root",
      "href":"http://www.openeo.org/api/v0/data",
      "title":"openEO R back-end"
    },
    {
      "rel":"parent",
      "href":"http://www.openeo.org/api/v0/data",
      "title":"openEO R back-end"
    },
    {
      "rel":"collection",
      "href":"http://www.openeo.org/api/v0/data/Sentinel-2A-L1C/collection",
      "title":"Sentinel-2A-L1C"
    }
  ],
  "geometry":{
    "type":"Polygon",
    "coordinates":[
      [
        [
          -180,
          -85
        ],
        [
          180,
          -85
        ],
        [
          180,
          85
        ],
        [
          -180,
          85
        ],
        [
          -180,
          -85
        ]
      ]
    ]
  },
  "startDate":"2017-01-01T00:00:00Z",
  "endDate":"2018-01-31T00:00:00Z",
  "contact":{
    "name":"European Space Agency (ESA)",
    "url":"https://sentinel.esa.int/web/sentinel/home"
  },
  "formats":[
    "jp2"
  ],
  "keywords":[
    "sentinel"
  ],
  "homepage":"https://sentinel.esa.int/documents/247904/690755/Sentinel_Data_Legal_Notice"
}

http://www.openeo.org/api/v0/data/Sentinel-2A-L1C/collection:

{
  "properties":{
    "collection_name":"Sentinel-2A-L1C",
    "provider":"European Space Agency (ESA)",
    "license":"https://sentinel.esa.int/documents/247904/690755/Sentinel_Data_Legal_Notice",
    "eo:platform":"sentinel-2",
    "eo:instrument":"S2A",
    "eo:collection":"Sentinel-2A-L1C",
    "openeo:proprietary":"Example"
  },
  "eo:bands":{
    "1":{
      "common_name":"coastal",
      "gsd":60.0,
      "wavelength":0.4439
    },
    "2":{
      "common_name":"blue",
      "gsd":10.0,
      "wavelength":0.48
    },
    "3":{
      "common_name":"green",
      "gsd":10.0,
      "wavelength":0.56
    }
  }
}

My thoughts and remarks from the conversion process in random order:

  1. The structure is much more complex than before (= openEO example). Merging the collection and catalog for Sentinel-2A-L1C would help to reduce it and avoid duplication. Merging the catalog and the collection would make things much more compact as there are less files to link, request and parse.

  2. I don't have any items. I only use the elements from the items in the collection that are useful for us. Is that valid? Probably not...

  3. From @matthewhanson in #105 :

    I think collection_name and collection_description are outdated fields, and instead the collection that an item belongs to is included as a link. Although, we didn't really discuss this much, and I'm keen on the idea of having the collection_name at least in an item. It's much more readable to a user to see a title than a link.

Why not simply remove the collection_name and collection_description? Why has the eo profile it's own eo:collection? I don't see much added benefit in them. You could link to the collection and add a title. Adding titles to links could be useful in general. It makes things a little easier to explore. I'd encourage to do that. I have already added that to the examples above.

  1. We don't need the formats attribute as the user doesn't handle the files, but the back-end is taking care of that. The documentation for this attribute is very limited. Is it required?

  2. I think it's a bit inconsistent that the eo:bands are not part of the properties. It also makes it harder to parse the file (how to easily get all properties that are not part of the core?)

  3. In the spec it mentions id, geometry, datetime, license, provider and other fields. It's not really clear whether they should be used as part of the properties object or in the top level object. id, geometry are for example in the top level, but I found datetime and provider in the properties?! Also, in examples I found "geometry" in both the top level object and the properties object.

  4. datetime is required by items, but it only allows a concrete date and time, which is not really available for image collections taken over years. So this usually can't be put into a collection, same could apply for bbox/geometry. How could we define the whole spatial and temporal extent of a catalog/collection? According to the definition for collections from @matthewhanson in #105 (see quote below) it is probably not meant to be in a collection, but it could be definitely be useful for catalogs. I just found that in catalogs, there is a geometry, startDate and endDate, which is only mentioned in the schema, but not in the textual description.

So, the idea of a collection is that it doesn't have any specific metadata defined for it. It is simply any metadata fields that would apply to all items in the collection. So all EO fields are defined for items only. Any of those could be put into the collection.

  1. I probably missed it, but how can I define from which image collection another image collection is derived from?
cholmes commented 6 years ago

Thanks for all the feedback @m-mohr!

Lots of good little detail here, but first the big picture:

I think we (STAC + Open-EO) should jointly define an 'EO Collection Metadata' json spec. Ideally with collaboration with opensearch EO profile as well. This should describe the specific properties to describe a collection of EO data. While also recommending other standards for more generic metadata, like DCAT. It's important for STAC so that we don't have to repeat the same information over and over again in each Item (and indeed can use Matt's 'merge' concept to produce fuller metadata records). And it's clearly important for Open-EO. It should also be useful for those who might want to do like a Catalog of all STAC collections - they could crawl the collection info.

I think we should also aim to align with WFS 3.0. Talking to the editors it sounds like the 'collections' metadata can be pretty flexible. See like http://geo.kralidis.ca/pygeoapi/collections/lakes (specified in feature collections metadata. It has the temporal and spatial extents there, and could expand for the other eo features.

With collaboration at the collection level we can get to a set of core fields to describe EO collections. Open-EO can focus on treating them as one 'thing', and STAC focuses on the search + access to descriptions of the individual scenes (which a cloud scale geoprocessing engine like GEE or GeoTrellis may leverage to construct their api interface used by Open-EO). So from that perspective I think 'stac' does aim to keep its focus on search of actual data. But we both need some way to describe the collection data, which is another level of 'search' - find me a collection of data that meets my criteria.

cholmes commented 6 years ago

As for some more detailed thoughts:

1) Yes, I definitely want to somehow merge 'collection' and 'catalog' - that's bugged me since we put it out, but I still don't have a clear picture exactly how to do it. As I also want to bring 'static stac' and 'stac api' closer together, and the latter doesn't have a 'catalog' concept. I think it should likely just be one main 'collection metadata' thing that contains most of the metadata. And 'catalogs' turn to mostly sets of 'links', and one could have a 'root catalog' that combines the two functions. I do think we need to think about static catalogs that are heteroegenous though - containing data from different collections.

2) As above I think we should just define the 'collection' metadata, though have things like 'license' share meaning and options/definitions between the two.

3) Makes sense to kill them, I think those were a bit lost in the transition from Matt to me. I do like the idea of using 'title' a lot. I've been thinking about making HTML pages from the stac JSON, and having link titles would be useful in some cases.

4) As you've noticed that catalog definition is pretty poorly thought out. Needs a lot stronger definition, as I think combination with other stuff. I'll try to finish my thinking on catalog vs collection vs wfs feature collection metadata and write up a proposal.

5) I'd be open to having it in properties, though I think it is just the 'join' case - in general I think bands should be at the collection / catalog level.

6) Good point - should work to explain that more. I think there's decent logic behind where each go, but it's not super explained (a lot is a desire to keep 'properties' as flat so it works well in existing geo software). Where is geometry in properties? That sounds like a bug to me.

7) We do have the notion that specific 'profiles' could define more times. But I think in this case we just want a temporal extent for the collection / catalog level. WFS has defined a temporal extent for feature collection metadata, and I think we should just use that: https://rawgit.com/opengeospatial/WFS_FES/master/docs/17-069.html#_feature_collections_metadata (particularly in example 4: "extent": { "spatial": [ 7.01, 50.63, 7.22, 50.78 ], "temporal": [ "2010-02-15T12:34:56Z", "2018-03-18T12:11:00Z" ] }

  1. I don't think we've really speced this, but the idea was to use a link with something like 'rel=source'. Though we were thinking about it more at the item level, less at the collection level.
m-mohr commented 6 years ago

Sorry for the late reply and thanks again for your answers and the willingness to work on a EO Collection Metadata JSON spec, which we both can benefit from. We've talked about in on Gitter already a little bit, but I'm still not completely sure how this should be incorporated with STAC. You said:

Oh, the idea is a full separate 'spec', not just a separate profile. It'd be like a 'microspec', hopefully just a couple pages with a json schema, that is just about EO collection level fields. Then the stac eo profile and open-eo would both 'use' it. So a unified approach for sure, just that we'd both reference a single document, instead of trying to shoehorn one in to the other. We can 'incubate' it in STAC EO profile - I'm just saying the goal should be a small 'eo collection metadata' spec/schema.

I think it makes sense to re-use the EO profile, but I'm not sure how a separate microspec could be bundled with STAC. Wouldn't it duplicate the catalog/collection in STAC or would that be replaced somehow? I could also imagine to have it still in STAC. Overall I am already pretty happy with it, but it surely needs some minor adjustments to be able to use it just for collection discovery only (i.e. make assets optional).

  1. and 2. I agree with you.
  2. Should we open a separate issue for this?
  3. Great. Can't wait to read it.
  4. It would be in line with GeoJSON, I think. Having it consistently in properties sounds reasonable to me, but it's probably also a matter of taste.
  5. Regarding the potential bug: The collection spec itself mentions the geometry at the properties level: https://github.com/radiantearth/stac-spec/blob/master/extensions/stac-collection-spec.md#using-a-collection
  6. Sounds reasonable. There was also this discussion with Matt regarding ranges, which might influence this.
  7. There is a rel type called convertedFrom, which is probably what we are looking for. Having it on the item level also makes it applicable for collections. Usually a collection of items is processed with the same processes and therefore it is shared by many items and can be at the collection level, too. Correct me if I'm wrong.

I'm probably not able to immensely bring this forward in the next two months, but maybe we can at least discuss the initial direction and get into details in July/August or so. I'll have a lot of time then to bring this forward.

cholmes commented 6 years ago

Progress on this with https://github.com/radiantearth/stac-spec/pull/116 but I think we still need a bit more work, so moving to 0.6.0 and keeping open.

m-mohr commented 5 years ago

I think this is solved or discussed better in other issues (#194, #174, #111, ...), I'd close it as solved or duplicate.