stac-utils / stactools

Command line utility and Python library for STAC
https://stactools.readthedocs.io/
Other
104 stars 28 forks source link

Implement templating for easier modifications of metadata like descriptions #68

Open lossyrob opened 3 years ago

lossyrob commented 3 years ago

In https://github.com/radiantearth/stac-spec/issues/986, @schwehr outlines a technique to use jsonnet to implement templating, which would allow users who are not necessarily Python devs to be able to effectively edit metadata that is then used to generate STAC Catalogs, Collections and Items.

I think this would be an effective technique to use in stactools. If we had jsonnet templates for the collections and items that were used to generate those objects, then users could make pull requests against stactools to update those templates in case there's any metadata errors or additions. There could be a core function that would take an object, say a Collection, and a template, and then update the collection based on any template values as a way to update the information.

For example, as someone who maintains a STAC API, I could:

schwehr commented 3 years ago

@lossyrob,

I'm not sure about the STAC API part. Do you mean jsonnet being returned by the STAC API? I've not looked at the STAC API yet, so that might be weird.

What we have now with Earth Engine:

email -> edit yaml/proto inside google
    -> triggers a brittle script that brute force writes STAC json -> copy that to the bucket
    -> triggers an update to https://developers.google.com/earth-engine/datasets

I believe that's basically what I was thinking. I was mostly thinking in the context of:

User does PR -> github repo containing jsonnet STAC -> triggers job to run jsonnet -> write STAC json to a bucket

Some of the use cases and concepts I was thinking about:

The functions I made for any thing that has a global extent:

  extent_global(start, end):: {
    spatial: { bbox: [[-180, -90, 180, 90]] },
    temporal: { interval: [[start, end]] },
  },

That means that a collection or item would only have to put something like this in the .jsonnet:

  stac_lib.extent_global('1992-01-01T00:00:00Z', null),

Trouble spots:

Some of these items boil down to projects need to do careful code reviews on new jsonnet just like they would do for any python, javascript, or any other full language. There is no networking or ability to read arbitrarily named files outside of those passed

lossyrob commented 3 years ago

Interesting, thanks for all this context and ideation.

I like the idea that we (meaning the folks managing public open STAC datasets, as well as anyone else who'd like to use them) could all pull from a repo that is full of the common information about datasets. That way we can collaborate on making sure all the collection-level data we are capturing is complete and accurate.

It would be pretty simple to set up a repo with templates for some of the collections that AI for Earth will be hosting in our STAC API soon. It would actually be nice to seed that with the collection information that GEE has put together. The pipeline of updating those collection templates to updating our STAC API information would be manual at first, but could eventually be automated. The STAC API part of it is just updating the collections in the database to be served through the API, so the pipeline would go one step further:

User does PR -> github repo containing jsonnet STAC -> triggers job to run jsonnet -> write STAC json to a bucket -> updates the Collection in the STAC API DB.

I didn't realize jsonnet was as powerful as it was - is this overkill? Have you seen situations where this is a lot better than say a markdown file or YAML that can be edited by anyone, then parsed and templated into JSON via Jinja2 or something similar? That sounds like what you have now with your reference to a brittle script, and good to know that eventually breaks...though even for the example of a global extent, there's a bit of requisite knowledge of how to import a library of functions like that into jsonnet and then call it in the new syntax. There's a cognitive burden for introducing more languages for folks to learn, and I want to be cognizant of that as the folks I think we'd want to attract to write content for the templates and review for correctness might not want to learn a new syntax.

That said, you've done a lot of digging into this and I trust you've considered this from many angles - if you think jsonnet is the best way forward, I'd be very willing to try it out. That'd start with creating a repo of jsonnet templates (probably in this github org) and like I mentioned seeding it with publicly available collection information for datasets we're interested in (the current next-month-horizon list is NAIP, Landsat 8 C2, Sentinel 2, and ASTER).

For some of the trouble spots - I think if the process was based on a PR-review workflow that systems could hook into to get notifications of new updates, with a CI that ensured you could review the rendered product (even potentially with a Netlify PR preview, which would be nice), that would mitigate some of the issues around attacks. If folks wanted an extra layer, they could keep their own internal repo and review upstream changes themselves before kicking off their own processes.

If we all go in on compatible open source licenses

This is a good point - how is the STAC metadata in GEE currently licensed? If we had a single repo with all the metadata and clear licensing that we all used, that would be ideal for sure.

schwehr commented 3 years ago

@lossyrob , Lots of good material there. Some follow on thoughts (that don't address everything)...

pipeline would go one step further

Anything we do for the Earth Engine is going to require human review by a Google employee before triggering any update. We do automate as much of the process as possible. If there is a common library of jsonnet things, we would be motivated to contribute to it and hopefully use it.

I didn't realize jsonnet was as powerful as it was - is this overkill?

It is a lot and I'm trying to keep the fraction of the (turing complete) language to a minimum. One thing it can't do is read files outside of libjsonnet files. e.g. it can't do a glob to list the files in a particular directory. For anyone feeling overloaded, that can actually stick to the original json and just rename the file. Then later, people (or better yet scripts) can clean it up. There are lots of templating languages and I think it is partly a subject choice as to which one to use. Most templating languages have specific strength (well, some are just too weak to add value). I like jsonnet as it is a superset of JSON. Most of the other templating systems know nothing of JSON. While it's yet another dependency, it is fast and pretty compact. And it's got the weight of heavy use: Using Jsonnet With Kubernetes

if you think jsonnet is the best way forward, I'd be very willing to try it out

It would be awesome to have others give it a go. If anyone tries it and doesn't like it, reverting to just the JSON is just running the forward transform.

Netlify

That's a new one to me.

how is the STAC metadata in GEE currently licensed?

There is currently no license on the metadata contained in the GEE STAC catalog and there is no copyright assertion in the JSON files.

As for the jsonnet code around that, I am pretty sure that Google (via me) will be releasing the jsonnet code around that metadata as Apache 2.0. All of the code snippets that I shared to date should have the copyright header and the SPDX Apache 2.0 tag, so folks should already be able to use the prototype examples I've shared (at least from the open source license point-of-view).

lossyrob commented 3 years ago

@schwehr on a tactical note, I tried to spin up a jsonnet for a collection I'm templating. tbh I found the documentation a bit hard to parse, so maybe this is easy and I just couldn't find a way to do it - I really want the "description" field for the collection to be a markdown file that can be edited easily, and then imported into the template as part of the render process. Do you know how to do that in jsonnet?

schwehr commented 3 years ago

I got some help getting started, so hopefully I can pass it along. Ask away so we can capture some of the parts that are confusing at startup.

Separate markdown files are not an option as jsonnet can't bring them in. The best I can offer the ||| accordion operator. With that, it becomes pretty easy to move the markdown into and out of the description to a markdown aware editor. I didn't find any editors that knew about markdown inside of jsonnet in a couple minute search (tools). e.g. heptio.jsonnet for VisualStudio doesn't seem to do much for |||

I have a similar issue with things that have large CSV tables in separate files that need to be expanded into the resulting STAC json somehow.

A quick example of markdown in |||:

{
  description: |||
    [GOES](https://www.goes.noaa.gov) satellites are geostationary weather satellites run by NOAA.

    The Fire (HSC) product contains four images: one in the form
    of a fire mask and the other three with pixel values identifying fire temperature, fire area,
    and fire radiative power.

    The ABI L2+ FHS metadata mask assigns a flag to every earth-navigated pixel that indicates its
    disposition with respect to the FHS algorithm. Operational users who have the lowest tolerance
    for false alarms should focus on the "processed" and "saturated" categories (mask codes 10, 11,
    30, and 31), but within these categories there can still be false alarms.

    [README](https://www.ncdc.noaa.gov/data-access/satellite-data/goes-r-series-satellites#FDC)

    NOAA provides the following scripts for suggested categories,
    color maps, and visualizations:

     - [GOES-16-17_FireDetection.js](https://github.com/google/earthengine-community/blob/master/datasets/scripts/GOES-16-17_FireDetection.js)
     - [GOES-16-17_FireReclassification.js](https://github.com/google/earthengine-community/blob/master/datasets/scripts/GOES-16-17_FireReclassification.js)
  |||,
}

Then running jsonnet demo.jsonnet gives demo.json:

{
   "description": "[GOES](https://www.goes.noaa.gov) satellites are geostationary weather satellites run by NOAA.\n\nThe Fire (HSC) product contains four images: one in the form\nof a fire mask and the other three with pixel values identifying fire temperature, fire area,\nand fire radiative power.\n\nThe ABI L2+ FHS metadata mask assigns a flag to every earth-navigated pixel that indicates its\ndisposition with respect to the FHS algorithm. Operational users who have the lowest tolerance\nfor false alarms should focus on the \"processed\" and \"saturated\" categories (mask codes 10, 11,\n30, and 31), but within these categories there can still be false alarms.\n\n[README](https://www.ncdc.noaa.gov/data-access/satellite-data/goes-r-series-satellites#FDC)\n\nNOAA provides the following scripts for suggested categories,\ncolor maps, and visualizations:\n\n - [GOES-16-17_FireDetection.js](https://github.com/google/earthengine-community/blob/master/datasets/scripts/GOES-16-17_FireDetection.js)\n - [GOES-16-17_FireReclassification.js](https://github.com/google/earthengine-community/blob/master/datasets/scripts/GOES-16-17_FireReclassification.js)\n"
}

I've been working on code to convert parts of the existing EE STAC JSON into jsonnet function. However, it's super specific stuff. e.g.

LINK_SELF_RE = re.compile(
    r"""{\s+rel:\s*'self',\s*href:\s*'https://[^']+',\s+},""", re.M)

def LinkSelf(src: str) -> str:
  link_self = """{ rel: 'self', href: self_url },"""
  result = LINK_SELF_RE.sub(link_self, src, 1)
  return result

this bit of python + import re find the self link and cleans it up to be:

    { rel: 'self', href: self_url },

Where self_url is defined at the top of the file:

local id = 'NOAA/GOES/16/FDCC';

local ee_const = import 'earthengine_const.libsonnet';
local ee = import 'earthengine.libsonnet';

local basename = std.strReplace(id, '/', '_');
local base_filename = basename + '.json';
local self_ee_catalog_url = ee_const.ee_catalog_url + basename;
local self_url = ee_const.catalog_base + base_filename;
local sample_url = ee_const.sample_url(basename);
schwehr commented 3 years ago

Related to this issue of templating is how to layout STAC catalogs. For something like jsonnet, the structure determines how hard it is to find required libsonnet files that might be sprinkled through the tree.

I have two prototypes described here:

https://twitter.com/kurtschwehr/status/1371896201063243777 points to this google doc: https://tinyurl.com/ee-stac-layout

Things like an all.json that links to all leaf nodes (for collections small enough where that is manageable), are not going to be doable with jsonnet. But it is fairly easy to have something else that can read arbitrary files build it from the the resulting json files from running jsonnet.

We have total flexibility as this STAC links are just URLs. Things could span multiple buckets, cloud hosting services (S3, GCS, Azure Blob Storage, etc).

People could go nuts and make a CAS, but I don't think folks are going to appreciate a structure like this with hashes.

gadomski commented 2 years ago

Sort-of-related, I've implemented less-than-templating for the modis package which I called fragments. I was noticing how almost every stactools package had a constants.py that contained a lot of static metadata. For MODIS, since there are so may products to support, the constants.py was getting very unwieldy. By breaking the constant information out into fragments, contributors can populate collection and item metadata in exactly the format that it would be in the STAC objects, as JSON values.

I'd be curious what folks think about formalizing the "fragments" concept into a stactools-supported setup. It's very intentionally less-than-templating, as I was trying to ensure that logic didn't creep into the static metadata (keep constants constant).