radiantearth / stac-api-spec

SpatioTemporal Asset Catalog API specification - an API to make geospatial assets openly searchable and crawlable
http://stacspec.org
Apache License 2.0
220 stars 47 forks source link

Periodicity Extension #34

Closed wildintellect closed 3 years ago

wildintellect commented 4 years ago

Many end users of STAC endpoints have requested the ability to query STAC records for a recurring interval. An easy example of this is often called seasonality; querying for records from Dec 21 to Mar 21, for years 2016 to 2019.

This ticket is to discuss the options and decide on a path forward for an API extension definition. Options developed with input from @sharkinsspatial @alukach @bitner

Options

There are many possible ways to implement. A few considered options:

  1. Extending time to be an array of times
    
    time = ["2016-12-21T00:00:00/2017-03-21T23:59:59.999"]

time = ["2016-12-21T00:00:00/2017-03-21T23:59:59.999", "2017-12-21T00:00:00/2018-03-21T23:59:59.999", "2018-12-21T00:00:00/2019-03-21T23:59:59.999"]


Pros:
* When a single value is passed it behaves identically to time as currently implemented.
* Allows any sequence of time ranges, even irregular intervals.
* Does not require the creation of a new parameter the would be mutually exclusive to time (see examples below, if new parameter is used, time would need to be ignored).

Cons:
* Would change an existing part of the spec that's well implemented.
* Allows overlapping time ranges.
* Parameter could be very large.

2. Adding a new query parameter with an array of times (This has been tested as a quick patch to a private fork of [sat-api](https://github.com/sat-utils/sat-api))

periodrange = ["2016-12-21T00:00:00/2017-03-21T23:59:59.999"]

periodrange = ["2016-12-21T00:00:00/2017-03-21T23:59:59.999", "2017-12-21T00:00:00/2018-03-21T23:59:59.999", "2018-12-21T00:00:00/2019-03-21T23:59:59.999"]


Pros:
* When a single value is passed it behaves identically to time as currently implemented.
* Allows any sequence of time ranges, even irregular intervals.

Cons:
* Allows overlapping time ranges.
* Parameter could be very large.

3. Adding a new query parameter as a time range + a repeat interval
Repeat interval has many possible ways of being expressed:
    1. ISO Frequency [notation](https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0039_iso_wd_8601-2_2016-02-16.pdf)(pg 10) `periodrange = "R3/2016-12-21T00:00:00/2019-03-21T23:59:59.999/FREQ=YR;INTR=1"`
    2. ISO DateTime [recurring time interval](https://en.wikipedia.org/wiki/ISO_8601#Repeating_intervals) (Using it in an unintended non ISO way) `periodrange = "R3/2016-12-21T00:00:00/2019-03-21T23:59:59.999/P1Y"`
      * Schema.org [repeatFrequency](https://schema.org/repeatFrequency) does exactly this.
      * As does [cyclic](https://github.com/cylc/cylc-flow/wiki/ISO-8601)
    3. [CRON](https://en.wikipedia.org/wiki/Cron#Overview) (Does not typically have years)

The best of these seems to be the Frequency notation, however I can't find any extant uses of this syntax nor existing parsing libraries.

Repeat every year, 3 times, within the duration

periodrange = "R3/2016-12-21T00:00:00/2019-03-21T23:59:59.999/FREQ=YR;INTR=1"

Another example showing intervals less than 1 year

Repeat every 6 months, 3 times, within the duration

periodrange = "R3/2016-12-21T00:00:00/2019-03-21T23:59:59.999/FREQ=MO;INTR=6"


Pros:
* parameter is a fixed size, 1 item
* prevents overlapping time ranges

Cons:
* Only allows regular intervals.
* requires more complex handling to parse durations into array of time intervals.
* Requiring the calculation of number of repeats seems redundant if you have the duration and repeat frequency.

## Backend Implementations

In all cases the implementation on the backend side needs to translate the notation into an array of times as an *OR* type query (*Should* in elasticsearch), where a record matches if it meets any of the time ranges. This could add time to queries, and makes the indexing of Date more important for optimization. In some cases it could be good to remove overlapping ranges to avoid redundant query time, and concern over duplicate records in a return. _Note: when periodrange is passed time should be ignored._

## Findings

Option 3.1 seems to be the best way moving forward, but that notation does not seem to be in use by any other projects. It would likely require implementation of a new library for each language that people implement a STAC compliant API with, as none of them seem to support the syntax. The closest thing we could find was [rrule](https://dateutil.readthedocs.io/en/stable/rrule.html) for calendars in Python.
m-mohr commented 4 years ago

In openEO, we use the array notation for a similar use case. It's easier to implement on the back-end side and seems a bit more versatile.

I'm wondering how that aligns with the CQL alignment? See #32

wildintellect commented 4 years ago

I agree that the array syntax is the simplest on the back-end, and that any other syntax for the api would need to be translated upon receipt before used. I think the reason I leaned towards the Freq option was to make it easy to do GET requests with a parameter instead of adding a non-standard body, or forcing the use of a POST.

It appears CQL would be pretty similar This actually looks really similar to how elasticsearch ended up.

CQL JSON: {
     "or" : [
             "during": {
                "property": "updated",
                "value": ["2016-12-21T00:00:00","2017-03-21T23:59:59.999"]
             },
            "during": {
            "property": "updated",
            "value": ["2017-12-21T00:00:00","2018-03-21T23:59:59.999"]
           }
    ]
}
philvarner commented 3 years ago

@cholmes I'm dropping this from the beta.2 milestone -- if we're going to do something beyond what CQL supports, it needs to have a lot more discussion and shouldn't block beta 2.

philvarner commented 3 years ago

In the current iteration of CQL JSON within the "Simple CQL" conformance class, it would look like this.

Assume a queryable defined for variable term updated.

{
     "or" : [ 
        {
             "anyinteracts": [
                 { "property": "updated" },
                ["2016-12-21T00:00:00Z","2017-03-21T23:59:59.999Z"]
        },
        {
             "anyinteracts": [
                 { "property": "updated" },
                ["2017-12-21T00:00:00Z","2018-03-21T23:59:59.999Z"]
        }
    ]
}

I imagine a python api for this would look something like this, where by a user passes a list of 2-tuples of datetimes and then that gets converted to the correct JSON syntax:


import iso8601
dt1a = iso8601.parse_date("2016-12-21T00:00:00Z")
dt1b = iso8601.parse_date("2017-03-21T23:59:59.999Z")
dt2a = iso8601.parse_date("2017-12-21T00:00:00Z")
dt2b = iso8601.parse_date("2018-03-21T23:59:59.999Z")

query = Query()
query.updated = [(dt1a,dt1b), (dt2a, dt2b)]
cholmes commented 3 years ago

Sounds good. May be ok to close it entirely. @wildintellect - we're adopting CQL, and it seems like it's expressive enough to capture what you want? And if not then we should raise this in the CQL issues (in features api ogc repo).