radiantearth / stam-spec

SpatioTemporal Asset Metadata specification - defining core metadata fields for searching imagery & other geo assets
Apache License 2.0
7 stars 4 forks source link

Modified spec proposed #18

Open matthewhanson opened 7 years ago

matthewhanson commented 7 years ago

@cholmes @danlopez00 After some thought and review on this, I'd like to propose a conceptual change as well as a modified spec.

It looks like the intention of the spec currently is to provide metadata per image file. I think a better way is to adopt the concept of a 'scene' (or 'granule'). Some problems with doing it per file:

In the cases above if metadata were provided per file there would be a large amount of duplicated metadata, most fields are the same except the URL and some band/file identifier.

If each file had it's own metadata it would also make it more difficult for client applications. Typically individual files or bands within a single scene are used together. If my app retrieves metadata for 700 scenes with 10 files each then it's got 7000 pieces of metadata to associate together in order to determine which RED band goes with which NIR band, or which cloud mask goes with which RGB file. My app would be much happier if it had the metadata for a scene and it contained links to all the related bands/metadata/masks for it.

Proposed spec

element type name description
sid string File unique scene ID
projection string Projection CRS of the datasource in full WKT format
footprint string Datasource footprint WKT format, describing the actual footprint of the imagery in it's native format
bbox array Bounding Box Pair of min and max lon/lat coordinates (min_lon, min_lat, max_lon, max_lat)
date string Date The nominal date of the scene, could be center time, or a single day within a window from which data is collected
start_date string Acquisition Start Date First date of acquisition in UTC (Combined date and time representation)
end_date string Acquisition End Date Last date of acquisition in UTC (Combined date and time representation) (optional)
platform string Unique name of platform Specific name of the platform (e.g., landsat-8, sentinel-2A)
sensor string Sensor used Name of sensor (e.g., MODIS, ASTER, OLI)
provider string Imagery Provider Provider/owner/maintainer of the data
contact string Contact Conact information of data provider (email or website)
license string Data license Data license name, must be one of the accepted licenses (TBD)
version semantic versioning number Spec Version The version of the imagery metadata fields, for testing/validation
links dict Download links Dictionary of data sources (distribution endpoints), each of which is a dictionary of URLs

Specific Notes

sid: I replaced sid for SceneID rather than uuid. UUID was the URL to the dataset (handled by links in this schema). I've not yet found a image dataset that does not have some unique name describing a single collected scene (containing multiple bands).

title: I removed this as I'm not sure what the title would be, or if it would be used in the vast majority of cases. I can see a human friendly title being automatically generated from other fields, in which case it need not be it's own field.

properties: I removed this as I'm not sure what it would contain. The spec should contain the minimum expected set, but any data source may provide more fields at the top level if desired, they need not go under a specific properties key.

gsd: I removed this because if the metadata spec is per granule or scene rather than per file then there is no guarantee that the resolution is the same in all images (e.g., Sentinel-2)

platform: More useful than a generic name here is a specific name referring to the unique ID of the platform (e.g., landsat-8, aqua, terra)

sensor: Added sensor, combined with platform this gives where the data originated from. Otherwise it would have to be determined from the scene ID somehow.

provider: If metadata can possibly contain multiple sources of data

links: An example here would be helpful since it's a dictionary of dictionaries. For each scene there could be multiple sources of that data. There could be mirrors, different formats of data through different providers... The example below illustrates what we see with landsat-8 data, which is available from USGS as a single tar file, and through AWS or Google Earth Engine as individual files. Each file has a specific key associated with it, which is a set unique to each sensor and it's up to the users to know what files are available for what sensor (will post separate issue on a "sensor metadata spec").

thumbnails: I've not shown thumbnail as a top level field above because it really belongs in the links section. There may not be a thumbnail available, or there might be multiple thumbnails (e.g., MODIS product MOD09GQ, ASTER).

{
  "links": {
    "usgs": {
      "thumb": "http://earthexplorer.usgs.gov/browse/landsat_8/2016/007/029/LC08_L1TP_007029_20160827_20170321_01_T1.jpg",
      "ALL": "https://earthexplorer.usgs.gov/download/12864/LC80070292016240LGN01/STANDARD/EE"
    },
    "aws_s3": {
      "index": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/index.html",
      "thumb": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_thumb_large.jpg",
      "ANG": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_ANG.txt",
      "B1": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B1.TIF",
      "B2": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B2.TIF",
      "B3": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B3.TIF",
      "B4": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B4.TIF",
      "B5": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B5.TIF",
      "B6": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B6.TIF",
      "B7": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B7.TIF",
      "B8": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B8.TIF",
      "B9": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B9.TIF",
      "B10": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B10.TIF",
      "B11": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_B11.TIF",
      "BQA": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_BQA.TIF",
      "MTL": "http://landsat-pds.s3.amazonaws.com/L8/007/029/LC80070292016240LGN01/LC80070292016240LGN01_MTL.txt"
    },
    "google": {
      "index": "https://console.cloud.google.com/storage/browser/gcp-public-data-landsat/LC08/007/029/LC08_L1TP_007029_20160827_20170321_01_T1",
      "ANG": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_ANG.txt",
      "B1": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B1.TIF",
      "B2": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B2.TIF",
      "B3": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B3.TIF",
      "B4": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B4.TIF",
      "B5": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B5.TIF",
      "B6": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B6.TIF",
      "B7": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B7.TIF",
      "B8": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B8.TIF",
      "B9": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B9.TIF",
      "B10": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B10.TIF",
      "B11": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_B11.TIF",
      "BQA": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_BQA.TIF",
      "MTL": "https://storage.cloud.google.com/gcp-public-data-landsat/LC08/01/007/029/LC08_L1TP_007029_20160827_20170321_01_T1/LC08_L1TP_007029_20160827_20170321_01_T1_MTL.txt"
    }
  }
}

Another nice thing about the keyed links is it provides an easy path to provide metadata for derived products. Let's say I produce an NDVI product for Landsat and distribute it. I can simply add it as a new key under a new provider in the links field with the "ndvi" key. I don't have to treat it as a new Scene entirely, which it's not.

cc @ianschuler @scisco @drewbo

cholmes commented 6 years ago

Sorry for the slow response on this. I think I saw it before but then totally forgot about responding. In general I like it a lot. I've been thinking about a sort of 'level 0' of the Imagery Catalog API - something that can be crawled and doesn't rely on a server. But ideally the json metadata records are basically the same.

The links make sense to me. With one hesitation, which is that I don't think I'd add NDVI product as an additional 'link', as I fear that can just get out of hand with links. We've already had this problem at Planet, with way to many 'asset types'. I'd see an NDVI derived product as a separate 'record', that would have a clear link back to the 'source'. I do think we should have some field that enables that linking back to the source record.

I'm also tempted to even hold off on lots of duplicate download links to the same thing. Like I'm thinking Google and Amazon should have their own catalogs, that specify that they are duplicated records and download links of the 'source' data.

I do think the catalogs should be flexible enough for lots of links, but also have mechanisms to represent a 'cached' record when it sits on another cloud or local storage.

matthewhanson commented 6 years ago

Those are good points @cholmes . Additional products really deserve to be separate, so I think I'd add a 'product' tag to the above e.g., surface reflectance, toa, DN, band indices...these are all different products where a sensor and platform could be the same.

I would prefer to not have duplicated links, but I could see some data that has mirrors and it would be useful to be able to specify that. In that case of an actual mirror the folder structure would be the same so maybe all we'd need is a list 'endpoints', containing the base urls or any mirrors. I could go either way on this, it will be good to discuss with a larger group next week.

cholmes commented 6 years ago

Yeah, I think there's always going to be lots of cases that break the core assumptions. I think we should aim for a core spec that handles 80% of the (non-derived) data out there, and then figure out the right extension mechanisms that let others add on. And key will be to figure out that extension mechanism - if we need to have multiple versions of the same record, or if we can have a core base with flexible additions.

But yes, will be really great to have the group discuss next week.