radiantearth / stac-spec

SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
https://stacspec.org
Apache License 2.0
772 stars 177 forks source link

Extend summary to support list of objects / Better describe summary values #1045

Closed jjrom closed 3 years ago

jjrom commented 3 years ago

In the Collection "summaries" property, a summary for a field can be specified in two ways - either 1) an array of "all distinct values in an array", or 2) "a Stats Object".

With this definition, the summary for the platform field with two platforms (sentinel-2a and sentinel-2b) would be: "platform": [ "sentinel-2a", "sentinel-2b" ]

In resto, we have additional statistical information for each fields (including platform). In particular, the number of items for each platforms can be provided to the user. It sounds natural to provide this information through the "summaries" property. As a consequence in resto, the previous example is written this way: "platform": [ { "id": "sentinel-2a", "name": "Sentinel - 2A", "parentId": "root", "count": 211199 }, { "id": "sentinel-2b", "name": "Sentinel - 2B", "parentId": "root", "count": 208632 } ]

Does the "all distinct values in an array" mean that the "value" is a string or a number or does it also mean that the "value" can be any object ? In the latter case, the resto approach is valid against the specification. Otherwise, could we update the specification to allow this approach ?

m-mohr commented 3 years ago

The values in the summary array can be objects, but the summary values need to be compliant with the original types as defined by STAC. So your platform example is invalid, it can only be an array of strings. But a field that defines an object could be summarized with an array of objects, of course.

That Stats Object for summaries allows objects, but in this case it's a single object per field and it must contain minimum and maximum values. So that's not so helpful for your use case.

Now that 1.0.0 RC1 has been released, we also can't really change the summaries anymore (before STAC 2.0). So you need to put your additional details somewhere else to be STAC-compliant. That could be a new extension or maybe for APIs it would be covered by the queryables endpoint from OGC APIs?

jjrom commented 3 years ago

Ok but in this example (https://github.com/radiantearth/stac-spec/blob/master/examples/collection-only/collection.json), the summaries property contains a "eo:bands" entry that does not fulfill the Stats Object

m-mohr commented 3 years ago

eo:bands (defined as object in the spec) is contained in an array there. I don't understand the issue with the example?

jjrom commented 3 years ago

Sorry I misunderstood the first explanation, i.e. that the properties in the summaries "must be compliant with the original types as defined by STAC". This is not clear from the specification (https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#summaries). Perhaps this should be explicitly added in the summary chapter ? To conclude and check your explanation, if i name the property in the first example "dummy_platform" instead of "platform" then this would be valid since "a field that defines an object could be summarized with an array of objects" ?

m-mohr commented 3 years ago

Okay, I've planned this to be improved for RC2 and add something like "summaries must be compliant with the original types as defined by STAC".

Of course, you can come up with your own field specifications, but then tooling will likely not understand it. It's probably better to figure out a way to make this work in a general-purpose extension. Although I'm not sure myself where this could be placed best.

cholmes commented 3 years ago

I do think we should have some mechanism to report 'counts' - a couple catalogs do it, and it's definitely a valuable thing to see.

I'm very open as to how we do it, and seems to make sense to start in an extension, though I'm potentially open to an optional field as we are going to do RC2 so can put in minor changes.

cholmes commented 3 years ago

@jjrom - can you describe the use case for needing this? Counts make sense to me, and I could see an extension just for that, but want to understand your parentId and name bits, so we're sure to suggestion a mechanism that can fully meet your needs.

jjrom commented 3 years ago

To clarify my initial requirement, forget the other properties in my example (parentId, etc.) and just keep the count property. The use case is to provide the count of the number of items in a stac catalog for each property value.

For instance I want to provide the number of items for platform "sentinel-2a" and for "sentinel-2b". With the constraint that the "summaries must be compliant with the original types as defined by STAC", it will never be possible to add this information within the summary property, because under "platform" only an array of string is allowed i.e. ["sentinel-2a", "sentinel-2b"]. The additional count information for "sentinel-2a" and "sentinel-2b" should then be provided in another property (e.g. "summary_extension" - forget the dummy name :).

This extension would complexify the implementation both server side and client side.

Server side, we need to duplicate platform information: the "STAC core" compliant "summary" property with an array of platform name; and an additional "summary_extension" property with an array of platform objects indicating the count attached to each platform value.

Client side, the platform information can be retrieved in two places: the core "summary" with limited information (i.e. the array of string) or the "summary_extension" property with richer information.

So I propose to make the summary property a bit more generic in the core specification as follows (wording to be refined) :

Summaries must be compliant with the original types as defined by STAC. In the case of a basic data type (i.e. string or number), summaries can be expressed either 1) with the type value or 2) with an object where the type value is provided within a "value" property.

For instance the summary "platform" could be expressed either as [ "sentinel-2a", "sentinel-2b" ] or as [ { "value":"sentinel-2a" }, { "value":"sentinel-2b" } ]

From an implementation point of view this is not a big deal - just add a if/else case on the summary type. But with a huge advantage that we can add additional fields to the object structure through extension directly within the summaries property. For instance, adding a "count" extension as discussed previously, or even a "title" extension to provide a human readable name to the value (for instance a human readalbe title could be added to each EPSG code defined here https://github.com/radiantearth/stac-spec/blob/rc.2/examples/collection-only/collection.json#L89-L150)

m-mohr commented 3 years ago

I'm opposed to change this for 1.0 as this is a rather huge breaking change (at least for my implementations), which we shouldn't do any more in a release candidate phase (rc should be feature freeze). There are numerous implementations that depend on the existing behavior, which would likely all break (e.g., STAC Browser, STAC Index, stac-fields, openEO API spec). Also, there should be one way of doing things, not two. So if we'd break things, then we could simply forbid the array completely (and break a lot of things). On the other hand, the summaries were meant to be just a small/light construct to give potential values, not give whatever additional information. (Leaving the option open in the Stats Object to add more fields was not very consistent though.)

I understand that there is a use case for it, but it seems that right now the only non-breaking option is a new extension, unfortunately.

jjrom commented 3 years ago

I understand your view but before rc2, the "summaries must be compliant with the original types as defined by STAC" was an implicit constraint...so until making it explicit in rc2, at least my implementations were perfectly valid from the specification. And now they are broken :) However, my main concern is not to update my code - but that the summary as it is expressed cannot be extended. A new property must be defined as an extension. And this property will basically duplicate summary values with additional fields. So we will end up with two fields for the same information. ps: I completely agree to remove the array completely. My proposal to keep both is to limit breaking changes in the implementation - current server implementations would not be affected by this update since they can follow the existing array specification

m-mohr commented 3 years ago

~No, sorry, that is not correct. Although the written documentation was incomplete, the schemas (which are part of the spec) were already enforcing the arrays so your implementation would have been reported as invalid in previous versions already. Please try any of the validation tools to validate your implementation and you'll see it will complain for (I think) all previous versions that had summaries.~

Yes, currently we would end up with two fields, which is unfortunate, I agree. On the other hand at some point we need to stop introducing breaking changes (in this case only for clients, indeed). For me that is the RC stage, but there are other opinions out there.

jjrom commented 3 years ago

@m-mohr "No, sorry, that is not correct. Although the written documentation was incomplete, the schemas (which are part of the spec) were already enforcing the arrays so your implementation would have been reported as invalid in previous versions already"

My implementation is correct from the definition ( https://github.com/radiantearth/stac-spec/blob/dev/catalog-spec/json-schema/catalog-core.json#L127-L134) the content of an array is "Any data type":

      {
        "title": "Set of values",
        "type": "array",
        "minItems": 1,
        "items": {
          "description": "Any data type could occur."
        }
      }

Since an object is "Any data type", I have the right to write the platform values as an array of objects: "platform": [ { "value":"sentinel-2a" }, { "value":"sentinel-2b" } ]

Only the "summaries must be compliant with the original types as defined by STAC" sentence enforce that in the case of "platform" the array must be an array of string and not an array of object. Nowhere in the schema this is specified

m-mohr commented 3 years ago

Yes, sorry, I missed that you had encapsulated things in an array. I thought somehow it's an object similar to the Stats Object. I guess it would have been helpful to use markdown code formatting above that makes it easier to read. And I should take more time to read. Indeed, we can't enforce the "same datatype" thing in JSON Schema without a huge amount of additional/duplicated code. Still, I'd argue several things in the spec at least strongly imply that you would need to specify just the values, e.g.

Anyway, I'm against yet another breaking change in the RC stage of the spec. But that is just my opinion and others may vote for it. ;-)

jjrom commented 3 years ago

The funny thing is that i exactly did what was written in the specification i.e. "See the examples folder for Collections with summaries to get a sense of how to use them"...and I found the "eo:bands" property which is an array of objects (with the value being in a property called "name" :) So i generalized that to all my properties to be able to add the "count" property.

I'm deeply sorry to raise this issue so lately in the process, but I think that is worth a larger discussion. I understand that this would have an impact on client implementation but it would provide a huge benefit to make the summary property easily extendable in the future.

m-mohr commented 3 years ago

Hmm, eo:bands follows the specified data type (a merged array of objects), of course. So that is to be expected that it contains an object with a name for example.

If we'd have got the feedback earlier, I guess I could have been talked into the object approach although I still like that the arrays as they are are more compact for most use cases.

cholmes commented 3 years ago

This is definitely another tough one. We should probably have a specific meeting on it with everyone involved. My initial thoughts:

m-mohr commented 3 years ago

We have at least 8 implementations in openEO, there are some non-openEO implementations out in the wild, we have deeply integrated the summaries in the openEO API and it is implemented in stac-fields, browser, index and openEO clients...

cholmes commented 3 years ago

@m-mohr - do you have a clean alternative idea for the 'counts' use case?

m-mohr commented 3 years ago

What is clean? I only have the solution to add a new field, there's no other solution I think. But I haven't fully understood the use case yet. What are the counts used for? How are counts used with the ranges/stats objects? Don't they always give all items? That seems not very meaningful.

To be clear: Breaking summaries will make openEO stick to 0.9 for one or two years. We can't adopt the change during the openEO 1.0 release cycle. The type field though will also not be required there, it will be optional. 🤷‍♂️

I'm not convinced Jerome's implementation is valid. Adding random objects doesn't allow you to get the values because it's undefined where they are stored then. Implementations can only guess which property contains the value. How should that work?

"Collections are strongly recommended to provide summaries of the values of fields that they can expect from the properties of STAC Items contained in this Collection. [...] A set of all distinct values in an array...." -> the values in item properties are clearly defined and they should be listed as a set.

jjrom commented 3 years ago

@m-mohr My implementation is a valid interpretation of what is written in the specification and it conforms to the json schema. It does not add random object but gives context to the summaries values.

My proposal does not break any existing server implementations nor any validation tools. It only impacts client to check if the values in an array are a list of basic types or a list of objects. And in the the latter case, if the client only supports a list of basic types, it can easily convert the list of objects to a list of basic types values using the "value" property of that object (i.e. in javascript var propertyAsList = propertyAsObject.map(x => x.value);)

Concerning the "count" use case : I want to provide users a summary of what is stored in the catalog - not only how many items are stored, but how many items per platform, per location, per date, etc. This allows to display on the landing page of the catalog a comprehensive set of statistics to indicates the repartition of the items per property. This is a very common use case among data provider (at least space agencies) to indicates on the landing page how many products they have.

Another example of the array of basic type limitation is that it supposes an implicit understanding of the values provided. The platform property contains obvious values (i.e. "sentinel-2a" and "sentinel-2b"). But this is not the case for proj:epsg property. For instance it is far more convenient to work with named projection (e.g. for France, I use the "Lambert-93" projection - i never remember its epsg code). So I see a strong interest to be able to provide a human readable title to be used in rendered displays of the property. This would be possible with an array of object instead of an array of basic types.

As @cholmes proposes, we should discuss this with everyone involved.

m-mohr commented 3 years ago

My proposal does not break any existing server implementations nor any validation tools

But several clients.

and in the the latter case, if the client only supports a list of basic types, it can easily convert the list of objects to a list of basic types values using the "value" property of that object (i.e. in javascript var propertyAsList = propertyAsObject.map(x => x.value);)

No, not yet, because the value property is not defined anywhere yet. And thus a client currently can't find any of the values. The spec says you should provide a set of values from the item properties, but you don't provide a set of values from properties. You are providing a set of objects which contain the value somewhere, but it's not clear where it is for clients. So I still think the interpretation is not correct and the spec was specifying it without the clarification about the data types that those objects are not allowed. It's implicitly, but I don't think there's really much room for making it some kind of object. Surely, this could be possible to be implemented in the future and I see the value in your use case, but for me it just doesn't work in summaries.

Another question: Is your use case more aimed towards APIs? Like updating counts in a server is easy, but updating a collection for every new item in a static catalog would probably be problematic. For APIs, I'm thinking that you could use OGC API Queryables (see https://portal.ogc.org/files/96288#filter-queryables), which allow quite a bit more for your use case (i.e. full JSON Schema). Doesn't that seem more reasonable to re-use this existing vocabulary instead of defining a new vocabulary here? We could also backport the OGC queryables spec to a queryables/schema extension (we could also call it schemas or something else, because queryables may not fit completely) and be added as new field to collections that would just have what APIs also define for search anyway. It also works for static catalogs, of course. Your platform example there could look like this:

"platform": {
    "oneOf": [
        {
            "const": "sentinel-2a",
            "title": "Sentinel - 2A",
            "parentId": "root",
            "count": 211199
        },
        {
            "const": "sentinel-2b",
            "title": "Sentinel - 2B",
            "parentId": "root",
            "count": 208632
        }
    ]
}

For EPSG codes some of my clients have a pre-defined list of EPSG codes and their titles and then can just use those without having them in the metadata.

jjrom commented 3 years ago

But several clients.

I don't know how many clients are in the field. And I don't either know how many of them support the optional summaries option. I don't know how many client could be affected.

This is why we need a larger discussion with other implementors.

m-mohr commented 3 years ago

The following clients and libraries implement summaries as defined in RC2 and are affected:

Plus:

m-mohr commented 3 years ago

To give an example, I've migrated the collection-only example given above (https://github.com/radiantearth/stac-spec/blob/master/examples/collection-only/collection.json) to the proposed API with consistently all objects. While making this, I realized that depending on the specification, it may not work this way to get counts per values in arrays, see eo:bands and instruments.

Before:

{
  "datetime": {
    "minimum": "2015-06-23T00:00:00Z",
    "maximum": "2019-07-10T13:44:56Z"
  },
  "platform": [
    "sentinel-2a",
    "sentinel-2b"
  ],
  "constellation": [
    "sentinel-2"
  ],
  "instruments": [
    "msi"
  ],
  "view:off_nadir": {
    "minimum": 0,
    "maximum": 100
  },
  "view:sun_elevation": {
    "minimum": 6.78,
    "maximum": 89.9
  },
  "sci:citation": [
    "Copernicus Sentinel data [Year]"
  ],
  "gsd": [
    10,
    30,
    60
  ],
  "proj:epsg": [
    32601,
    32602,
    32603,
    32604,
    32605,
    32606,
    32607,
    32608,
    32609,
    32610,
    32611,
    32612,
    32613,
    32614,
    32615,
    32616,
    32617,
    32618,
    32619,
    32620,
    32621,
    32622,
    32623,
    32624,
    32625,
    32626,
    32627,
    32628,
    32629,
    32630,
    32631,
    32632,
    32633,
    32634,
    32635,
    32636,
    32637,
    32638,
    32639,
    32640,
    32641,
    32642,
    32643,
    32644,
    32645,
    32646,
    32647,
    32648,
    32649,
    32650,
    32651,
    32652,
    32653,
    32654,
    32655,
    32656,
    32657,
    32658,
    32659,
    32660
  ],
  "eo:bands": [
    {
      "name": "B1",
      "common_name": "coastal",
      "center_wavelength": 4.439
    },
    {
      "name": "B2",
      "common_name": "blue",
      "center_wavelength": 4.966
    },
    {
      "name": "B3",
      "common_name": "green",
      "center_wavelength": 5.6
    },
    {
      "name": "B4",
      "common_name": "red",
      "center_wavelength": 6.645
    },
    {
      "name": "B5",
      "center_wavelength": 7.039
    },
    {
      "name": "B6",
      "center_wavelength": 7.402
    },
    {
      "name": "B7",
      "center_wavelength": 7.825
    },
    {
      "name": "B8",
      "common_name": "nir",
      "center_wavelength": 8.351
    },
    {
      "name": "B8A",
      "center_wavelength": 8.648
    },
    {
      "name": "B9",
      "center_wavelength": 9.45
    },
    {
      "name": "B10",
      "center_wavelength": 1.3735
    },
    {
      "name": "B11",
      "common_name": "swir16",
      "center_wavelength": 1.6137
    },
    {
      "name": "B12",
      "common_name": "swir22",
      "center_wavelength": 2.2024
    }
  ]
}

After:

{
  "datetime": {
    "minimum": "2015-06-23T00:00:00Z",
    "maximum": "2019-07-10T13:44:56Z"
  },
  "platform": [
    {
      "value": "sentinel-2a"
    },
    {
      "value": "sentinel-2b"
    }
  ],
  "constellation": [
    {
      "value": "sentinel-2"
    }
  ],
  "instruments": [
    {
      "value": [
        "msi"
      ]
    }
  ],
  "view:off_nadir": {
    "minimum": 0,
    "maximum": 100
  },
  "view:sun_elevation": {
    "minimum": 6.78,
    "maximum": 89.9
  },
  "sci:citation": [
    {
      "value": "Copernicus Sentinel data [Year]"
    }
  ],
  "gsd": [
    {
      "value": 10
    },
    {
      "value": 30
    },
    {
      "value": 60
    }
  ],
  "proj:epsg": [
    {
      "value": 32601
    },
    {
      "value": 32602
    },
    {
      "value": 32603
    },
    {
      "value": 32604
    },
    {
      "value": 32605
    },
    {
      "value": 32606
    },
    {
      "value": 32607
    },
    {
      "value": 32608
    },
    {
      "value": 32609
    },
    {
      "value": 32610
    },
    {
      "value": 32611
    },
    {
      "value": 32612
    },
    {
      "value": 32613
    },
    {
      "value": 32614
    },
    {
      "value": 32615
    },
    {
      "value": 32616
    },
    {
      "value": 32617
    },
    {
      "value": 32618
    },
    {
      "value": 32619
    },
    {
      "value": 32620
    },
    {
      "value": 32621
    },
    {
      "value": 32622
    },
    {
      "value": 32623
    },
    {
      "value": 32624
    },
    {
      "value": 32625
    },
    {
      "value": 32626
    },
    {
      "value": 32627
    },
    {
      "value": 32628
    },
    {
      "value": 32629
    },
    {
      "value": 32630
    },
    {
      "value": 32631
    },
    {
      "value": 32632
    },
    {
      "value": 32633
    },
    {
      "value": 32634
    },
    {
      "value": 32635
    },
    {
      "value": 32636
    },
    {
      "value": 32637
    },
    {
      "value": 32638
    },
    {
      "value": 32639
    },
    {
      "value": 32640
    },
    {
      "value": 32641
    },
    {
      "value": 32642
    },
    {
      "value": 32643
    },
    {
      "value": 32644
    },
    {
      "value": 32645
    },
    {
      "value": 32646
    },
    {
      "value": 32647
    },
    {
      "value": 32648
    },
    {
      "value": 32649
    },
    {
      "value": 32650
    },
    {
      "value": 32651
    },
    {
      "value": 32652
    },
    {
      "value": 32653
    },
    {
      "value": 32654
    },
    {
      "value": 32655
    },
    {
      "value": 32656
    },
    {
      "value": 32657
    },
    {
      "value": 32658
    },
    {
      "value": 32659
    },
    {
      "value": 32660
    }
  ],
  "eo:bands": [
    {
      "value": [
        {
          "name": "B1",
          "common_name": "coastal",
          "center_wavelength": 4.439
        },
        {
          "name": "B2",
          "common_name": "blue",
          "center_wavelength": 4.966
        },
        {
          "name": "B3",
          "common_name": "green",
          "center_wavelength": 5.6
        },
        {
          "name": "B4",
          "common_name": "red",
          "center_wavelength": 6.645
        },
        {
          "name": "B5",
          "center_wavelength": 7.039
        },
        {
          "name": "B6",
          "center_wavelength": 7.402
        },
        {
          "name": "B7",
          "center_wavelength": 7.825
        },
        {
          "name": "B8",
          "common_name": "nir",
          "center_wavelength": 8.351
        },
        {
          "name": "B8A",
          "center_wavelength": 8.648
        },
        {
          "name": "B9",
          "center_wavelength": 9.45
        },
        {
          "name": "B10",
          "center_wavelength": 1.3735
        },
        {
          "name": "B11",
          "common_name": "swir16",
          "center_wavelength": 1.6137
        },
        {
          "name": "B12",
          "common_name": "swir22",
          "center_wavelength": 2.2024
        }
      ]
    }
  ]
}
jjrom commented 3 years ago

Matthias would you be available for a meeting in the next days ? I'm sure that this will allow us to reach a mutual understanding that we obviously are unable to achieve here.

m-mohr commented 3 years ago

Yes, sure. Monday is a public holiday for us, Wednesday has some meetings, but otherwise, I'm pretty flexible.

m-mohr commented 3 years ago

After a nice discussion with @jjrom this morning, we have come up with a potential solution that at least in my STAC implementations and openEO would be the least breaking.

It basically leaves the array structure as it is and only allows native data types, but instead it uses the extensible structure of the Stats Object and evolves it a bit further. We could make it a Schema Object instead of a Stats Object and align it with JSON Schema. As JSON Schema allows new keywords to be defined, we'd define a new keyword "values", which is an array of objects with details about every value, see variant 3. The breaking change would be that the Stats Object, would either require "minimum" and "maximum" (to specify an extent) or allow the "values", which is similar to the array, but can only be an array of objects with a "value" and probably at least one additional property (we have not discussed this requirement yet, Jerome, but I think that would help to distinguish it from the array of native typed values).

Example:

{
  "summaries": {
    // variant 1: Array with Stac-native data types, including array merging - no change to rc2
    "platform": [
      "Sentinel-1A",
      "Sentinel-1B",
      "Sentinel-1C"
    ],
    // variant 2: Schema Object - formerly known as the "Stats" Object (not very meaningful example)
    "platform": {
      "minimum": "Sentinel-1A",
      "maxmimum": "Sentinel-1C"
    }
    // variant 3: Schema Object - addition in rc3 is that either "values" (array of objects) or "minimum" and "maximum" are required
    "platform": {
      "type": "string", // You can optionally add whatever JSON Schema keywords that helps you describe things better
      "values": [ // New JSON Schema keyword
        {
          "value": "Sentinel-1A",
          "count": 321
        },
        {
          "value": "Sentinel-1B",
          "count": 123
        },
        {
          "value": "Sentinel-1C",
          "count": 222
        }
      ]
    }
  }
}

One thing to keep in mind is that this would be yet another breaking change in RC stage of the specification with potential to break things and have side effects that we are not aware of yet. The clean alternative without breaking things (but requiring a bit of duplication in collections) is to define a separate "schema" extension, e.g.

{
  // variant 4: new extension to specify JSON Schema per field
  "schemas": {
    "platform": {
      "type": "string", // You can optionally add whatever JSON Schema keywords that helps you describe things better
      "values": [ // New JSON Schema keyword
        {
          "value": "Sentinel-1A",
          "count": 321
        },
        {
          "value": "Sentinel-1B",
          "count": 123
        },
        {
          "value": "Sentinel-1C",
          "count": 222
        }
      ]
    }
  }
}

So we need feedback on whether another breaking change by adding in variant 3 would be worth it and supported by the STAC community or whether we should be conservative and evolve new things in STAC extensions first (Alternative 4).

jjrom commented 3 years ago

Thanks you again @m-mohr for the great discussion this morning !

It's true that variant 3 introduces a breaking change which is not exactly what we want in a RC release. However this change should have zero to a a limited impact to existing implementations and it will allow future extension to the "summaries" property which brings a huge benefit to the specification.

cholmes commented 3 years ago

Glad you guys had a good discussion and got to some solid ideas. I can get behind option 3 and doing a breaking change, as it is nicely minimized to just the stats object. I do hear you on it potentially introducing things we're not yet sure of. But it does feel a bit duplicative to have to list all the fields twice.

Using JSON Schema I do think is fairly elegant. Though also ironic in light of the comment I made when we first added summaries - https://github.com/radiantearth/stac-spec/issues/413#issuecomment-472565303 But I think I actually like this approach, of using a subset of JSON Schema. It seems to let us pick which parts of JSON Schema to use, instead of us re-inventing something that people want JSON Schema for.

I'm +1 on either option, and I think overall I do like option 3 better than 4, since I feel like a new field called 'schema' that is a json schema could lead to confusion and people trying to fully implement the JSON schema for their collection there. So I like that it's in a 'summary', but we're reusing JSON schema to describe a summary.

m-mohr commented 3 years ago

Using JSON Schema I do think is fairly elegant. Though also ironic in light of the comment I made when we first added summaries - #413 (comment)

I actually quoted you in the call with Jerome. You said in the SF sprint at Planet that you don't want yet another schema language in summaries... ;-)

But I think I actually like this approach, of using a subset of JSON Schema. It seems to let us pick which parts of JSON Schema to use, instead of us re-inventing something that people want JSON Schema for.

Limiting JSON Schema to a subset was not discussed yet by @jjrom and me, but probably makes sense. On the other hand, we discussed requiring (minimum and maximum) OR (values), but on the other hand, people then may say they only want to specify other things, e.g. type. I also see use cases where you only want to specify a minimum as a maximum is not defined (or vice-versa)? Then there's also minimumExclusive and maximumExclusive. So getting this right may take a couple of iterations. That's why it may not be a bad idea to evolve it in an extension despite the duplication. Also, there might be duplication, but right now we say that some values like "datetime" don't make much sense to be summarized, but someone may want to specify a "schema" for it. Lastly, we may introduce a second place where schemas can be defined for fields. There's "stac_extensions" with schemas (that currently can't really validate summaries, but that's not the point) and there's then schemas in summaries (or the extension). Those are all things that came to my mind after the call and I doubt it's easy to get this right just in a couple of days until RC3 (or even 1.0). If we don't feel confident enough, there's maybe a compromise to weaken the Stats Object a bit and evolve the semantics in there in an extension? Anyway, I was just typing out more thoughts here and we clearly need to think more about getting this to a solid state, I think.

I'm +1 on either option, and I think overall I do like option 3 better than 4, since I feel like a new field called 'schema' that is a json schema could lead to confusion and people trying to fully implement the JSON schema for their collection there.

Of course, the name was just an example. We can name it whatever we like to avoid confusion.

cholmes commented 3 years ago

Limiting JSON Schema to a subset was not discussed yet by @jjrom and me, but probably makes sense.

I'm also ok to just have it evolve through use, and just have the recommended fields to use in the spec, with examples. Key to me would just be not expecting tooling to have to fully support JSON schema parsing to be 'compliant' with the spec.

If we don't feel confident enough, there's maybe a compromise to weaken the Stats Object a bit and evolve the semantics in there in an extension?

I'm not sure exactly what you mean but I think I had a similar thought, but was hesitant to suggest it. My thought was to do something like make the stats object an extension, and have the 'core' for now just be the array, and evolve the object option in an extension. I think maybe you're saying to keep an object in core, but it minimal and extensible, and evolve the semantics in an extension? Whatever the exact path is I think I do like it. A more extreme option would be to move summaries as a whole to be an extension, in line with the same instinct to push all the extensions out of core, so the core spec is as slim and extensible as possible. But I don't need to push for it, just throwing it out there.

jisantuc commented 3 years ago

tbh I don't really understand what was wrong with the original implementation. "Any data type" means any data type, so in stac4s we just model summaries as a JsonObject since that's all that we definitely know about it (we don't know anything about its keys or the kinds of values they should contain). The summaries object having a shape depending on the items in the catalog is still true in all of the proposals above, so I don't expect that to change much. Clients who know / need something specific are responsible for narrowing that json object on their own.

Re:breaking changes, I agree with @cholmes that this is the kind of feedback that RCs are for -- the spec was unclear, which made an apparently innocuous change actually a breaking change, so that should be resolved. Refining the type of summaries to me seems like the right direction for resolving it.

m-mohr commented 3 years ago

My intention was not to move summaries to an extension! Also, it still needs a way to specify ranges, but we could loosen the requirement for having minimum and maximum as required properties and simply allow JSON Schema in there with a recommendation to use minimum/maximum for ranges? Not sure.

Anyway, if most people are fine with breaking changes, then we can simply PR variant 3 IMHO.

The PR should also fix the schema to say

"For each field only the original data type of the property can occur, but we can't validate that in JSON Schema yet."

instead of

"Any data type could occur."

m-mohr commented 3 years ago

While writing up the JSON Schema for this, I realized that we have an inconsistency with JSON Schema in our spec: We allow minimum and maximum to be strings (mostly for temporal fields). This is not supported by JSON Schema and thus not all variants can actually be compliant with JSON Schema, which makes the description of the change much harder...

m-mohr commented 3 years ago

On the other hand, we may get away without defining our own keyword. We can use "oneOf" (or "anyOf") and "const" instead of from JSON Schema. Example:

    // variant 3: Schema Object - addition in rc3 is that either "values" (array of objects) or "minimum" and "maximum" are required
    "platform": {
      "type": "string",
      "oneOf": [
        {
          "const": "Sentinel-1A",
          "count": 321
        },
        {
          "const": "Sentinel-1B",
          "count": 123
        },
        {
          "const": "Sentinel-1C",
          "count": 222
        }
      ]
    }