radiantearth / stac-spec

SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
https://stacspec.org
Apache License 2.0
795 stars 178 forks source link

STAC Extensions Prefixes suggestion #357

Closed ghost closed 5 years ago

ghost commented 5 years ago

In the extension documentation of STAC API the use of prefixes is suggested to include vendor specific properties like:

  "properties": {
    "datetime":"2018-01-01T13:21:30Z",

    "dtr:start_datetime":"2018-01-01T13:21:30Z",
    "dtr:end_datetime":"2018-01-01T13:31:30Z",

    "eo:off_nadir_angle": -0.001,
    "eo:cloud_cover": 10.31,
    "eo:sun_azimuth": 149.01607154,
    "eo:sun_elevation": 59.21424700,
    "eo:resolution": 30,
  }

I would suggest to avoid prefixes, but to use sub-categories/dict/maps/groups to specify properties of extension:

"properties": {
  "datetime":"2018-01-01T13:21:30Z",
  "dtr" : {
    "start_datetime":"2018-01-01T13:21:30Z",
    "end_datetime":"2018-01-01T13:31:30Z"
   },
  "eo": {
    "off_nadir_angle": -0.001,
    "cloud_cover": 10.31,
    "sun_azimuth": 149.01607154,
    "sun_elevation": 59.21424700,
    "resolution": 30,
  }
}

This will make the implementation in languages that support "objects to JSON" serialization like Python much easier, since each property can be mapped directly from a class member variable.

In addition one can identify all extensions easier, without the need to parse all parameters to check the prefixes.

jdries commented 5 years ago

I also ran into this issue while using an entirely different environment: we use apache NiFi to process metadata records. While trying to use STAC metadata with properties like eo:cloud_cover, I got an error from the avro library which also does not support colons in its property names. Avro is also a data serialization system, which allows json records to be represented in a binary form, for more efficient processing. I have not yet been able to implement a workaround for this, mostly because this is not really a 'programming' environment where I can easily sanitize and desanitize property names without introducing a lot of overhead for our operators.

m-mohr commented 5 years ago

Thanks for pointing that out. You are referring to the extensions of the STAC specification in general, not just STAC API, I guess.

  1. I see the problem that : can be a problem in certain environments. Nevertheless, this is an implementation issue as : in JSON keys are valid according to the JSON specification and some tools decided to not implement it fully.
  2. The reason to keep the properties flat and not to implement it the way @mundialis-dev proposed it is that some GeoJSON readers don't support objects/arrays as properties. In the end, all properties except "datetime" wouldn't be readable by some tools. This is also an implementation issue as arrays and objects in properties, are valid according to the GeoJSON specification.

That said, both issues mentioned are implementation issues. So in theory there shouldn't be a problem, but in practice there is as the specifications are not fully implemented.

As the range of allowed characters is very limited (Python for example only allows , 0-9, a-z), the only solution I could imagine is to replace the : with a ``.

Example:

  "properties": {
    "datetime":"2018-01-01T13:21:30Z",

    "dtr_start_datetime":"2018-01-01T13:21:30Z",
    "dtr_end_datetime":"2018-01-01T13:31:30Z",

    "eo_off_nadir_angle": -0.001,
    "eo_cloud_cover": 10.31,
    "eo_sun_azimuth": 149.01607154,
    "eo_sun_elevation": 59.21424700,
    "eo_resolution": 30,
  }

That doesn't look as expressive as the : and is not very "sexy" in terms of style, but I don't see a very big technical issue here as I think nobody really used the : for any sort of separation of extension part and field name.

PS: That's solely my personal opinion.

cholmes commented 5 years ago

Thanks for the input @mundialis-dev and @jdries! So I was the one to introduce the prefixes, and the main reason was to try to point to desired future alignment with JSON-LD, which uses a lot of prefixes. Of course in JSON-LD those prefixes are attached to context, but I wanted to hint at that future, where a vendor or a community would actually define their fields with JSON-LD.

Note that there is no requirement for extensions to use the prefixes - they can define the fields as they want. So any extension can choose to use a nested style if they want. I do prefer the fully 'flat' style, as @m-mohr said, so that GeoJSON readers aren't thrown off - most GIS systems don't understand nested data structures, so with nested JSON they would just display the core fields and then a few blobs of json. And that display of a record in a GIS system is pretty important. And though it's a bit more annoying to write special case handling, it is still possible.

I am interested in a STAC extension that is more built for programmatic access. I was thinking about protobuf, but could just as easily be avro or others. STAC API's could then return the JSON or a binary depending on client request. I know @davidraleigh had some interest in doing a protobuf version of STAC. To me the point of the core STAC isn't ease of programming serialization, but as the core reference. Can use STAC Browser to turn it into HTML for human views. And would make sense to me to have tools that turn it into a format optimized for programmatic access.

jdries commented 5 years ago

Hi @cholmes, thanks for the response. It's clear that having extensions makes sense, and that nesting is also an issue for certain tools/libraries. One compromise could perhaps be to have all of the very common properties in an 'extension' without prefixes. I'm mostly thinking about properties such as 'eo:platform', eo:collection, eo:cloud-cover or c:description. I don't really see the value here anyway of having 'eo:' in front of them. (Other examples like the 'landsat:' prefix add more meaning to the property.) Doing so would make a lot of the examples look again like your initial proposal, which was so attractive because of it's simplicity. In the meanwhile, the door remains open for json-ld, and for more specialized extensions, that can indeed choose on their own how to organize their properties. I'm not sure we can separate the json representation from programmatic access. For instance, our current setup (at VITO) for processing and indexing stac records, uses tools that are natively built for json, such as ElasticSearch. Also Python is very often used for this kind of task, because of how simple it is to parse json.

m-mohr commented 5 years ago

@jdries Who decides what is common? Without any prefix, several fields would conflict as they define the same name (e.g. SAR and EO). So the prefix is required (not saying that it must be the double colon though).

tools that are natively built for json

If they are built natively for JSON, why can't the tools handle the double colons? That's valid JSON as of the JSON specification.

jdries commented 5 years ago

@m-mohr To be clear, I wouldn't drop prefixes entirely, only from the 'EO' fields (assuming those are the common ones). Hence this would avoid a clash. I would also consider to what extent SAR really needs a separate extension, given that it is also rather common in EO.

As for tools, I guess the bottomline here is that we have identified a very concrete issue with Python, which is a tool you can't really ignore in the EO community. In that sense my proposal to avoid those prefixes for the 'common' cases might not even be a satisfactory solution for a lot of people. (For me personally, it would help in the sense that 90% of my use cases only require a few very common properties.)

m-mohr commented 5 years ago

@jdries The EO extension (recently renamed to "Electro-Optical" from "Earth Observation") is more a MSI extension rather than a full EO extension, which is a bit confusing at first hand. Therefore a SAR extension is needed as the EO extension doesn't cover the requirements to describe SAR data comprehensively. That's why there will be two extensions. And prioritizing one over the other isn't a thing the specification should do, I feel. So, I'd stay with prefixes for all extensions, but we may discuss another separator (the only one would be _, I guess).

Why isn't Python deserializing into dicts? That's also done by JS, PHP, and others and solves the issue, they all have the same naming restrictions as Python has.

jdries commented 5 years ago

My proposal was in fact not to have all these different extensions like sar or electro-optical, just a single namespace for all these basic things. Having an extension for everything reminds me of inspire, where we also have gmd, gmx, gmi, gml, and nobody knows what they mean. SAR and 'electro-optical' will anyway be managed centrally, so why would it be so hard to avoid using conflicting terms? In terms of impact, what would be the downside of using fewer prefixes? On the positive side, I would say that overall usability would increase, not only for Python programmers. You would also continue to be compatible with JSON-LD, which also does not seem to require that everything is prefixed. Note that I'm not the biggest Python supporter myself, so I can't really answer why the programming language or it's users like to do things in a certain way.

m-mohr commented 5 years ago

These design decisions were made before I joined STAC, but I think the idea was to keep each "sub-specification" loosely coupled from each other and to not have (too much?) dependencies (e.g. dependencies between extensions). Also, not every spatio-temporal dataset is an EO dataset for example, so having EO fields in core feels wrong. Each extension is clearly specified and there are not too much of them yet, so I think its not as bad as something like gmd, gmx, gmi etc. (Who invented those terrible names?) Anyway, I think @cholmes is the person who can better answer this as he laid out the foundation for this.

cholmes commented 5 years ago

I think in the medium to long term I'm not opposed to dropping extensions for things that truly become core. But I believe the time to do that would be when the extension reaches a 'stable' maturity, in https://github.com/radiantearth/stac-spec/tree/dev/extensions#extension-maturity Basically once we all feel really good about the extension and truly commit to supporting it in the core. I think I'd also want to see more extensions from non-satellite imagery before we 'upgrade' the EO stuff, to make sure we didn't 'reserve' some keyword that means something else in a different domain. I don't want EO to be 'special' just because it came first. But I do think we could commit to disambiguating all fields in 'stable' extensions, and indeed then we could also aim to define 'common' fields between like pointcloud and EO, etc.

The core philosophy for me with STAC is to stick with a very small 'core' that people can use and adapt as they want, and to enable innovation in the extensions. And then to make standard what people actually use, instead of specifying ahead of time. So migrating from extension to 'core' too early I think can lock in 'more' than is ideal.

I also think it is ok for extensions to not use the prefixes if they want - I don't think the prefix idea is fully locked in. STAC is not aspiring to be all things to all people, it's aspiring to be a useful tool for people providing data and API's. So if it's not useful to use the prefixes people can make an implementation that uses the EO fields without the prefixes.

But I definitely hear the concern, and I'd like to get common names in without prefixes - just want to see if the whole STAC community, not just the EO focus, matures.

Thanks for engaging on this @jdries - I really appreciate the feedback. Would be great if you could come to the next 'sprint' (which we may try to do virtually), and convince more implementors of EO data stores of this, and I'd be happy to follow.

cholmes commented 5 years ago

I'm also curious to hear from more client side / tool implementors on the scope of this problem. Like could we 'help' this situation with more client libraries? Or tools to help convert? cc @mojodna / @matthewhanson / @jbants

60south commented 5 years ago

As someone new to STAC and still learning about geoJSONs in general, I have to say that the eo: extensions have been very confusing. I have a strong preference for simplicity, and the proliferation of colons and extensions in the key words looks messy.

More importantly, the eo: extension seems inappropriate in many cases. For instance, eo:sun_elevation ought to mean the same thing regardless of the data collection source. For example, if I am collecting, say, soil temperatures from an in situ sensor, sun_elevation still has relevance even though I'm not using an electro-optical instrument.

Adding additional extensions (e.g., "SAR") would only muddy things even more unless they were clearly relevant only to one data collection method. For instance, we're doing work with passive microwave sensors, which have some metadata overlap with SAR (polarization, receiver bandwidths, etc).

@cholmes wrote:

I'm not opposed to dropping extensions for things that truly become core. But I believe the time to do that would be when the extension reaches a 'stable' maturity

This worries me because it seems like we'd be locked-in to the extensions by then. Wouldn't it be better to decide this before things reach that maturity level? Maybe I'm not understanding how this process works.

My .02

glenn

cholmes commented 5 years ago

Thanks for sounding in @60south! Would love to hear more about your use case and how you'd use STAC. And definitely appreciate the feedback.

Fully agree with you on 'simplicity', but fear that once you get deep into a domain then simplicity is in the eye of the beholder. So what we have been attempting to do in STAC is really just define the core - space and time, and then leave it to implementors to define fields how they want. If we see a number of different domains and implementations doing the same thing then we'd elevate that core. Rough Consensus and Running Code to guide what goes in core.

I fully agree that we need to pull out the 'common' metadata fields, like your sun elevation and polarization examples. The question to me is how do we get to these? In my experience the 'typical' geo specification way has been to have a small group of people attempt to enumerate what @jdries calls the 'basic things'. We could do that at the next STAC sprint, or attempt it online, and get to a 'pretty good' list. But I'm sure we'd get a number of things wrong when we're reaching to domains we don't understand, and then STAC would 'feel' like a satellite data thing.

We've been attempting a different tack to get to those common ones, which is for each domain to focus on their extension. To look to the other extensions and to try to align with them, so there's constellation in both SAR and EO. And once we had critical mass of people doing things in the way they want we attempt to pull out the 'common' fields. So we start with the 'running code' (data in that format), and extract from there, instead of specifying up front 'this is what platform means for everyone'.

I don't anticipate this taking a long time - I'd just like us to have drafts of 5+ domain extensions before we start trying to extract. Similarly I don't think it needs to take a long time for an extension to reach maturity - the EO extension I think has at least 4 up to date implementations, and a couple more that just need to be updated. I just prefer that we start the discussion of what to make common from looking at a bunch of real world implementations, not what we imagine it could be.

I think we could aspire to do this for 0.7.0 or 0.8.0 - at the very least between SAR and EO and Point Cloud. We'd make a new extension, maybe call it 'extended core' or something, that doesn't have a prefix. I'm also open to another route to 'group' things - I really don't feel strongly about the ':' (though I do feel fairly strongly anti-nesting, due to most GIS software not understanding nested JSON).

@60south - would love to see what metadata fields are relevant to you and the passive microwave detection. Even if it's just a ticket describing the 'common' fields you need, that would help. My worry is that we'd define like polarization in a way that is too specific to SAR.

It sounds like there is consensus that we need some way to group/differentiate things, like from a particular vendor like landsat. And there is rough consensus to work towards some extended core common fields, that hopefully can 'feel' like part of core (though I think I'd still define them in an additional spec, instead of adding 20 fields to the core we have). And to try to get there sooner rather than later. I just desire us to be implementation driven on that extended core, with people trying out things in code and reporting back.

Thanks to everyone for sounding in on this thread. Feel free to add more here, and I'll try to create some issues against 0.7.0 to at least talk further about these things and hopefully address them.

60south commented 5 years ago

@m-mohr wrote:

As the range of allowed characters is very limited (Python for example only allows , 0-9, a-z), the only solution I could imagine is to replace the : with a ``.

Do you have a reference for this? I've been looking through the Python documentation but have come up blank.

m-mohr commented 5 years ago

@cholmes One worry with "unprefixing" at a later stage is that implementors actually have to change things. So if we say some fields of eo/sar will be "extended core" without prefixes then all implementations would need to change their implementations, but nothing has really changed in semantics. It would be really great if we could try to avoid such changes and make implementors life easier or we will face issues with outdated implementations more often.

@60south I was referring to the initial use case by @mundialis-dev which was to unserialize data from STAC to python classes, i.e. storing each property value in a variable and use the property name as variable name. Python variable names are restricted to _, 0-9, a-z which can be found in numerous resources, e.g. https://realpython.com/python-variables/#variable-names I was not speaking about JSON parsing in general. Python in general has no problem with parsing JSON property names that include special characters, e.g. a double colon.

davidraleigh commented 5 years ago

From the perspective of someone implementing a proto file for STAC protobufs and gRPC services, I would prefer underscores _ instead of the colon : because the compact IRI concept is not available in proto file definitions. If we keep the JSON-LD compact IRI : operator, STAC protobufs would be diverging from the STAC spec. Also, since there is a way that users could define a JSON consuming/delivering gRPC service from proto files, that JSON would also be a malformed STAC result.

I guess that the compact IRI definition would be useful if you had lots of overlap between suffixes. Like if there was a different value for eo:gsd vs landsat:gsd for the same data item, then you could use the same operator for both. But I don't know what the case is where this applies, except maybe for eo:datetime vs landsat:datetime, but in that case you're just doubling down on a non-descriptive variable name.

I get that checking to see if a key starts with eo_ is ugly in code, but in the end most of these services might not need so much dynamism in their naming. The services will probably be hard coded to the keys they expect to be parsing.

joshfix commented 5 years ago

If we do want to consider using underscores as namespace delineators, I would recommend not using snake_case for field names. For example, we currently use eo:sun_elevation. If changed to eo_sun_elevation, it becomes less clear what the namespace is, unless you always deem it to be the text before the first underscore. However that introduces further problems if you wanted to have a namespace with more than one word: my_namespace_my_field.

At risk of being tarred and feathered, one option would be to use camelCase. You could then support a naming convention like myNamespace_myField. Not as pretty IMHO, but it works.

m-mohr commented 5 years ago

That looks really ugly. ;-) I think we could limit namespaces to one word and in this case I don't see a real problem with the underscore as separator. But in the end we still need to get to a consensus here, could also be influenced by the current aim to align more with LD stuff.

joshfix commented 5 years ago

Agreed, it's very ugly. But I would be opposed to using snake case and using an underscore to separate namespaces. It makes it impossible to programmatically determine if there even is a namespace as the field name might just be two words separated by an underscore.

m-mohr commented 5 years ago

What do you need this for? I haven't seen an implementation yet that specifically uses the extension in the field for something, still you could determine it with the full name, checking against the fields defined in JSON schemas. Of course, that's not as easy as just splitting at the colon.

joshfix commented 5 years ago

I don't have a personal use case in mind, I'm just participating in the discussion. It seems this thread is attempting to address issues with using a colon in the JSON fields. Are you saying that there is no use in parsing namespaces or field names, or determining the difference between the two, and the only way to determine if a field belongs to a namespace is to compare every field in every known schema to each field you are processing? That just doesn't seem like a good plan. It should be extremely simple and obvious when looking at a field what the field name is and what the namespace is, and you 100% should be able to write code that can grab a properties field and distinguish the namespace from the field name.

I think most here are biased against camel case (based on early discussions at the code sprints). I personally don't care and am fine with camel or snake case. The only thing I am completely against is using the same character as a separator for namespaces/field names as word the separator in the field name. I proposed an alternative that is admittedly ugly, but allows you to distinguish both namespace from field name and between multiple words in either the namespace or field name.

All that being said, I don't like it at all and prefer the way we have it now. I have no issues with colons being used as the separator.

m-mohr commented 5 years ago

Are you saying that there is no use in parsing namespaces

No, I just don't have a use case in mind at the moment and wondering what one could be.

and the only way to determine if a field belongs to a namespace is to compare every field in every known schema to each field you are processing?

Well, if we'd switch to underscore as separator and leave everything else as it is right now, the only viable way to get the namespace is (I think) to compare against the schemas. I'm not saying this is good practice.

and you 100% should be able to write code that can grab a properties field and distinguish the namespace from the field name.

I just don't see yet what this could be useful for and so I don't know why we need this "100%" as I haven't seen anybody doing so yet. At the moment it seems everybody is just using the whole field name for whatever operations on the values.

I think most here are biased against camel case (based on early discussions at the code sprints).

We also had a discussion in one of the telcos and decided against camel case (again).

All that being said, I don't like it at all and prefer the way we have it now. I have no issues with colons being used as the separator.

Me too, but others seem to have problems with it and now we need to figure out whether we cater for it or not. For example, if we get colons into the spec again when adapting JSON-LD, it doesn't make much sense to change our extension behavior with colons.

joshfix commented 5 years ago

I don't think it matters if anybody is currently trying to grab a set of namespaces being used an item or not, or if anybody understands the use case. Maybe you want to know what the namespaces are you so you can grab those specific schemas? Or maybe you're mapping the field names in the item to some other data that originated at the source that doesn't use namespaces and you just want the field name. Maybe a front end is displaying items from multiple collections and it needs an easy way to associate any given items field with it's extension. Just shooting from the hip here, but good design involves being prepared for the unexpected and not necessarily understanding every possible future use case. If there is no use in distinguishing between namespace and field name, then there is no point in having namespaces.

But again, if the consensus is to get rid of the colon, I was just offering one possible solution. The historical decision for snake case over camel case was one of style, but now the discussion is about function. I would be the first to vote against my own suggestion as I really like what we currently have... and I don't use Python :D. It's visually appealing and everybody is familiar with the colon separator for namespaces from XML, so it's pretty clear what is going on. I also like your point about JSON-LD using colons, which makes me even more inclined to leave things the way they are.

A quick google search provided several workarounds for the colon issue in Python. I totally understand the frustrations with the colon, and I would cringe if I had to use such workarounds in my code, however I'm not totally convinced we should revamp our entire data structure for the sake of the limitations of one programming language.

joshfix commented 5 years ago

Also, is this likely to cause an issue for python as well?
https://github.com/radiantearth/stac-spec/issues/386

m-mohr commented 5 years ago

We are basically on the same boat.

I'm not totally convinced we should revamp our entire data structure for the sake of the limitations of one programming language.

Me neither, but there are several users having problems not just with Python (see tools used by @jdries and the comment about gRPC). So it should at least be considered and discussed. (But my first impression is that an issue should be opened with the author of the tool @jdries uses as it is not fully JSON compliant?!)

Also, is this likely to cause an issue for python as well? #386

In theory potentially yes, but practically I doubt it. The JSON schemas are usually only consumed by the validator and not by a wide range of tools as our catalogs. The validator is Python-based, but uses libraries that can handle special chars as they are widely used in JSON Schema and Python doesn't have a problem with special chars in JSON in general, just the direct class (de)serialization doesn't work as discussed above.

jdries commented 5 years ago

Looks like we're not there yet, so let me try to add a new idea to consider: What if we add a property to the spec that contains the list of used extensions, like extensions: ['eo','landsat'] This would be similar to the 'default namespace' concept in xml, except that we would allow multiple extensions to be imported. These are the good and less good points that I see with this idea, trying to take into account previous remarks:

Note that for me, the issue is now more about using xml style prefixes in json versus tools or libraries that should or should not support colons. Fact is that a lot of serialization formats (avro, protobuf) seem to have an issue with colons for some other reason. It's actually similar to the design choice of not introducing object hierarchies: those are also perfectly valid json, but some tools (and people) do have issues dealing with them.

By the way, thanks a lot for actually taking the time to discuss and consider this issue. I've had to deal with worse standardization processes before ;-).

m-mohr commented 5 years ago

@jdries I think your idea is very related to what is discussed here: https://github.com/radiantearth/stac-spec/issues/278 (and I think it's a good idea to raise your points there, too).

cholmes commented 5 years ago

I'm going to close this one - thanks everyone for the great discussion, but most of the community seems happy with the prefixes. I do like the idea of 'default namespaces' as a way out, but let's put that into its own issue. Happy to entertain that idea before we go to 1.0-beta. And indeed I'd say if there's more specific ideas let's just open them in their own issues. I also do want to explore more JSON-LD type implementations, which naturally namespace in a more meaningful way.