Rename or Remove the Collection Extension to be less confusing

omad commented 6 years ago

At STAC Sprint 3 there's been lots of discussion about the Collection extension, mostly coming down to it having a confusing name.

I and several others have found it very confusing when first learning about STAC what the difference between a Catalogue and a Collection. Across the entire field of publishing data collection is a common term with many different meanings.

It seems pretty universal that it's a useful extension to have, but we need to find a new name.

Proposals:

Scripture
Overlay
Inherits from
Common fields
CX

m-mohr commented 6 years ago

Commons

Everything starting with C would allow to keep the c: prefix and therefore no changes in the files/implementations are required.

matthewhanson commented 6 years ago

Of all these, I think "commons" is the best, although really I still prefer Collection because by looking at the Collection record you see immediately what are the common things that define every Item that belongs to it...that is the Collection of Items. It is descriptive. A catalog is a bunch of items, that may be split among several collections.

If the name is changed, will it really be immediately apparent what everything is, or will you still need to review the documentation? Do you think that the new name will remove the ambiguity and confusion?

Really not fond of Scripture at all, I've never heard the term used in that way and was unable to find a dictionary definition for it online that matched the idea of repeating data.

hgs-msmith commented 6 years ago

"Scripture" was used as a deliberately ridiculous name to force us to give this concept a proper name ;^)

m-mohr commented 6 years ago

Collection doesn't seem to be as intuitive as one might think. Several person on sprint 3 explicetly mentioned that they found it confusing. They had probably a different concept in mind. When I first read through STAC some time ago, I thought collections were products/datasets as the name is used by CNES and NASA and it took several issues and chats on Gitter to clarify that. During the STAC sprint, we were forced several times during discussions to clartify whether the collection extension or collection as a name for datasets was meant. I support renaming it to Commons or something similar. Matthew, you also defined it as "the common things" and others were also clarifying it as "common fields" or "non-repetitive fields" during discussions, so that's why I think Commons is the best choice. Collection itself is just overloaded as a term for so many things that you still need to check the documentation. That's why we also did not use Collection as a term for collection-level metadata, but took Dataset instead.

cholmes commented 6 years ago

+1 on 'commons', or 'common fields'. And yes, I think a new name will remove ambiguity and confusion.

Could even be more explicit - it's the 'Common Fields extension'. And call the json file the 'common field definition file'.

"links": [ { "rel":"common-fields", "href": "http://landsat-pds.s3.amazonaws.com/L8/common-fields.json"} ],

is a lot more self explanatory to me than:

"links": [ { "rel":"collection", "href": "http://landsat-pds.s3.amazonaws.com/L8/L1T-collection.json"} ],

cholmes commented 6 years ago

@m-mohr - would you be up to make a PR on this one? I like the idea of keeping the c: prefix. Though I do think it'd be good to change the 'rel' name and file names in the examples.

m-mohr commented 6 years ago

@cholmes Yes, I can make a PR next week.

Which name are we going with? I like both, but I also thought that it could be more explicit than "Commons", but I had "Item Commons" in mind which sounds a bit odd to me.

fredliporace commented 6 years ago

I'd go with 'common-fields'

simonff commented 6 years ago

So what exactly is the difference between Collection and Dataset now? They both can have a description or band names.

I believe Scripture was also meant to contain the common asset layout (eg, each item has two thumbnails), but https://github.com/radiantearth/stac-spec/blob/master/extensions/stac-collection-spec.md does not mention how to do that.

m-mohr commented 6 years ago

@simonff Just to make sure we are speaking about the same thing: This issue is about common fields for a subset of items (= collection extension / Scripture). This is different to our dataset spec, which makes claims for the whole product. Indeed, this can be similar for certain datasets.

Seems like this is a good example why the name needs to be changed.

I think the asset layout was an idea that has yet to be defined and included into the spec. Maybe there should be an issue for that?

simonff commented 6 years ago

What's a concrete example of a collection of items that have common fields, but are not a dataset?

m-mohr commented 6 years ago

That leads to the question: Should STAC items repeat the content from catalogs (or datasets once specified), e.g. licensing information? /cc @cholmes

@simonff I did not imply to say that these subsets of items are not datasets. Of course, all items usually belong to a dataset, but a subset of the items in the dataset could have diverging fields. For example, all items from 2010 until 2015 have bands 1 to 8, but afterwards have only bands 1 to 7 as 8 got defective or so. Then you would define respective collections for this subsets of items to indicate this difference, but the dataset itself would comprehensively cover bands 1 to 8 as 8 is available for a subset of items.

simonff commented 6 years ago

Right, see #184 that I just created independently. We probably need more examples.

matthewhanson commented 6 years ago

@m-mohr The licensing I think is a different issue. I can see a case where items in a catalog might have different licenses (@jeffnaus mentioned this during Sprint 1), but let's keep it separate.

What I was trying to say (poorly) the other day in the group meeting was that I can see Common-fields as simply part of a dataset. The dataset does include it's own fields that describe the entire set of data as a whole, what you might call gestalt fields such as the spatial and temporal extent of the entire collection.

However, other fields in the dataset in fact due apply to every single item. License is one that is appropriate to each individual item. With the dataset concept I'm not sure Common-fields make sense any more. If I look at the dataset it would be very useful to be able to see the definition for all of the fields that are common across the items. For example, it's way more user friendly to look at the dataset definition and see all of the eo:bands specs, resolution, and know that it applies to all the items that are part of that dataset.

@m-mohr In your example of some assets not having all the bands, that's fine - it would still be part of the same dataset because the later assets just simply would not define assets that contain the defective band. The eo:bands field is the list of possible bands that are referred to by assets.

Take a look at sat-api's Sentinel-2 collection and tell me that it doesn't include everything a user wants to know about the Sentinel-2 dataset: https://sat-api.developmentseed.org/search/stac?c:id=sentinel-2-l1c It just doesn't include those gestalt fields that describe the dataset as a whole.

I don't think it's worth having both Datasets and the idea of common-fields. I think instead we Datasets should be defined with a specific set of gestalt fields and allow the data provider to add to any other fields they choose that describe the data. Otherwise users have to look through multiple records (Dataset definition, Common-field definitions) in order to find simple information such as bands, resolution, instrument name, etc.

m-mohr commented 6 years ago

The question (Should STAC items repeat the content from catalogs?) is still valid, I think. License was just an example. But that's probably just a matter of defining a best-practice and is nothing that should be directly be in the spec.

I agree that common fields may be superfluous once we have the dataset spec. I never really understood why the collections were in separate files and not just in the catalogs. But honestly, I have not too much experience with a wide variety of datasets to really oversee whether there may be cases where independent collections would make sense.

Though, the merging would not really work any longer, I think. That could be a problem when indexing items At least, I think I remember @hgs-trutherford mentioned something like this.

The sat-api example has most information available, but we defined some additional fields for the dataset spec. Many of them are just adopted from the catalogs though. @matthewhanson

matthewhanson commented 6 years ago

@m-mohr I'm not sure what you mean by merging. Do you mean when you merge together an Item with the Dataset record to get a complete record?

I think that can still work. The dataset fields that apply to all dataset (the gestalt fields) should have their own prefix, such as dataset:). You could even still merge them into the record and the user would see those dataset level fields in each Item, or they could just not include those dataset fields when merging.

I'm not sure about the separate files vs catalogs because I haven't dealt much with the flat files case. In the case of sat-api collections is it's own endpoint, and is searchable so you can search through available collections: https://sat-api.developmentseed.org/collections

m-mohr commented 6 years ago

Yes, the collection extension mentions that the collection data can be merged into items.

Maybe we talk at cross purposes. I thought you proposed common fields would be replaced by datasets and all fields would apply to the items. I don't get why you would need the dataset: prefix. Also, would fields from extensions have two prefixes, i.e. dataset:eo:epsg?

Personally, I don't need merging, but I am pretty sure there were discussions around requiring merged items in the context of indexing them.

matthewhanson commented 6 years ago

@m-mohr So if datasets are their own records, completely separate from items, and all the fields are what I've been calling 'gestalt' fields, then you don't need a prefix.

But what I'm suggesting is that a dataset definition includes two things:

gestalt fields that apply to the dataset as a whole. In which case the prefix is used to indicate those fields
Common fields. These are fields that apply to every item, but doesn't apply to the entire collection of data. For instance it doesn't make sense to say that the entire set of data has an instrument field. Individual items have an instrument, a scene has an eo:instrument field indicating that scene was collected with that instrument. Likewise, dataset:geometry is the union of all geometries from all scenes. It does not apply to an individual scene.

Merging is important in the context of searching and for ease of the end user. When I retrieve a record I don't really want a partial record that excludes all the common fields, I want those in my final record as well or else it forces me to go and retrieve another object. Likewise when I search on something like eo:instrument I want it to include all the items collected with that instrument even if eo:instrument doesn't appear in the individual record and only within the common fields.

Thus common fields that appear in the dataset would not have two prefixes. The dataset: prefix indicates the gestalt fields that apply to the dataset as a whole, and all other fields are common fields.

I'm concerned that if we keep the concept of common-fields separate from datasets it's going to make implementation terribly difficult, and also conceptually harder for the user to grasp the different records that are used.

cholmes commented 6 years ago

I think it's an interesting idea, but I fear it's overloading too much conceptually.

A dataset is the fields that describe the dataset. Saying that most all of them are common fields, and then special casing some 'gestalt' fields seems complicated. Also making the dataset be the only place to specify non-repeating fields may make it so implementations artificially break things up in to different datasets to be able to make use of certain non-repeating fields.

I think do like the idea of having an item roughly 'inherit' information from the 'dataset', but I think I stop short of saying that those are 'part' of the item, which is what common-fields (aka collection) extension (collection extension) means to me.

But I don't think it should be a requirement that every search implementation should need to be able to merge all the fields in the 'dataset'. The common-fields extension is the implementor saying explicitly 'you should merge these fields'. That these are things individual items should be able to search on.

Perhaps we could have it as like an 'opt-in' attribute on the dataset. But it just feels cleaner to me to have a dataset link to its 'common-fields' in a separate file, which is those that should definitely be merged.

I think it should be an extension. If we say that dataset has to be a 'merge' thing then that puts a much bigger requirement on implementations. We should enable more naive implementations, where users first just search for the dataset they want, and then they can search for the fields within it. It's harder to make that an 'extension' - or rather it's a weird extension that says 'if you use this extension then these dataset fields are treated differently'.

The core should be 'we have dataset fields to describe sets of items', and then an extension is 'items enable search of fields that they share with others'.

Then you can have a dynamic catalog that crawls a static catalog, and it adds the 'dataset fields' to the 'common-field' extension, that makes it clear you can search on them.

matthewhanson commented 6 years ago

@cholmes you bring up some good points. I think that if we can include a link and attribute on the dataset to common-fields (as an extension), then that resolves my main concern which is that as a user I want to be able to look at a dataset and be able to tell what common fields are (most importantly, the bands for EO data) without having to first go to one of the items in the dataset and then follow the link to common-fields. That's very unwieldy and makes it non-trivial for clients to be able to work generally with datasets. You can't be guaranteed that there's a common set of bands across the dataset for instance.

However, keeping them separate would allow a provider to have a single dataset that has items that use more than one set of common-fields, such as a Sentinel dataset that doesn't describe bands, but with Sentinel-2A and Sentinel-2B items that each have their own unique set of common-fields. That wouldn't be my preference implementing it, I think it makes it more difficult for a user, but it would give providers that option.

m-mohr commented 6 years ago

I got somehow a little lost, but I am wondering whether the common fields would repeat information available in the dataset, e.g. EO information such as bands?

I think we may just wait for a draft of the dataset spec, which will probably have no connection to common fields yet, and then have some examples for real-world datasets and decide based on that. Until then, I'd just stick with the collection extension as is and just rename it for now. Seems like we decided to use "Common fields" instead of "Commons"? I can make a PR for that tomorrow.

matthewhanson commented 6 years ago

Well, right now eo:bands would appear in common-fields, but if it were in the dataset (which would make sense), it would actually take care of 90% of my issues here since eo:bands was the main driver behind my implementation of Collections to begin with. Because if eo:bands is moved to a dataset then there's actually very few common fields left: https://sat-api-dev.developmentseed.org/collections/landsat-8-l2/definition https://sat-api-dev.developmentseed.org/collections/sentinel-2-l1c/definition

And even of those that are left, Description, Provider, and License also would belong in the dataset. So I think the entire Collections/common-fields is becoming moot, if it's just one or two fields left it's not worth the complexity and might as well just be duplicated across Items.

m-mohr commented 6 years ago

Have you had a brief look at the dataset spec, @matthewhanson ? We propose there to simply adapt/use the whole EO extension with some additions that may also be added to the EO spec itself. So all (or most?) EO fields would be in the dataset anyway.

Provider (in an extended form) and license are available at the dataset level, too.

simonff commented 6 years ago

@cholmes: what specific problems are you actually trying to solve by adding the notion of common fields that are not clearly dataset-level fields like 'license'? (And if assets in a dataset can have varying licenses, maybe the dataset should not have a license specified at all.)

For reference, in EE we almost never use this notion. (Granted, EE is a computation engine with a primitive catalog bolted on the side, so it's not a fully representative example.) Each asset stores its own band names, projection and so on. Earlier we tried to store the common band schema on the collection to speed up some computations, and it hurt us due to valid band mismatches in some cases more than it helped in speeding up lookups - so we stopped doing that. We still have one case where we have to refer to the collection for asset-level ACL checks (assets in a collection don't have their own ACL). This is necessary, but makes the whole ACL-related code much more complicated.

So I'd be wary of adding the notion of 'common' fields unless I clearly understand why they are needed.

cholmes commented 6 years ago

Yeah, maybe let's just see the dataset spec and then see if common fields is needed after that? I was going to say that 'bands' is one that I think maybe doesn't belong in 'dataset', but I guess if you all see it there as an EO extension then maybe it makes sense?

I think the potential reason to have 'common-fields' is to enable a specific definition of what to 'merge' against, instead of having to selective ignore things like 'spatial extent' and 'temporal extent'. So it'd basically be a way for an Item to specify that 'I'd like to enable search on these common fields, but I don't want to repeat them in every record'. I am hesitant to say that by default items should enable search on all the fields in the dataset they reference.

I think the 'bands' example is probably the important one. I'd say putting 'bands' at the dataset level (which it sounds like you're recommending against @simonff ) as an eo extension would be one way. We could have each item have its bands. I think that's where I would see 'common-fields', where a set of items could refer to its band definition with a link to its common-fields. And it could also choose to specify its own. Or you could have a set of different 'common-fields' that different items refer to.

Alternatively we could just have an 'eo bands file', with a link rel='bands' to it.

But yes, for now, I'm +1 on a PR to change the name of the collection extension. And then to see the dataset definition, and then to decide if we need that extension at all.

cholmes commented 6 years ago

@m-mohr - I'm not sure I understand how you're 'using' the whole EO spec in dataset? I think fields like cloud_cover and the azimuth stuff don't make sense in the dataset? I think you should specifically define the relevant EO fields that are valid for the dataset. And indeed it seems like it might be different ones that are required / optional.

matthewhanson commented 6 years ago

I'm thinking now that a common-fields extension is not needed if Band level metadata is in the dataset because there is then very few fields left that it would be used for. I think we need to finalize the dataset spec, which I see as very similar to how I'm using collections right now in sat-api.

I don't see how storing bands in the dataset can be detrimental though. An item's assets refers to the bands, but is not required to use them at all, they are just there if the asset includes the bands. An item may have assets that don't refer to eo:bands at all, such as in the case of thumbnails or non-band data such as arrays of solar-azimuth per pixel as in the case of some MODIS data.

I asked this before, but are we intending for Datasets to be required? Does every item need to belong to a Dataset?

m-mohr commented 6 years ago

@cholmes :

I was going to say that 'bands' is one that I think maybe doesn't belong in 'dataset'

Why is that? I can see that there are some cases, where it is better suited in items, but having a complete list over all bands in the dataset should help to find suitable imagery. And we clearly need it for our processing. As we don't have assets in openEO (and GEE?!), we need to know at the dataset level which bands are nir and red for NDVI for example.

Alternatively we could just have an 'eo bands file', with a link rel='bands' to it.

I really don't like splitting everything in so many files. I don't even like that the collections are in their own files. That's because we would again need more API endpoints for them. Of course, that's my API point of view. For static STAC that's probably not a good argument, but still stac-browser is already firing so many HTTP requests that even more requests don't make things better/faster.

But yes, for now, I'm +1 on a PR to change the name of the collection extension.

I'll prepare one.

I think fields like cloud_cover and the azimuth stuff don't make sense in the dataset?

Well, cloud cover has some limited use on dataset level, too. It could indicate whether a dataset was pre-processed to remove clouds, i.e. a value of 0 (maybe just in an ideal world with an ideal algorithm). Azimuth doesn't make sense on dataset level, sure. But implementers would probably know that anyway and skip it.

I think you should specifically define the relevant EO fields that are valid for the dataset. And indeed it seems like it might be different ones that are required / optional.

I just don't really like repeating stuff, that's why I started with just referencing to the EO extension and propose new fields. In the end I'd prefer to have one extension that covers both datasets and items. They probably need separate sections on datasets and items though, sure.

@matthewhanson :

I don't see how storing bands in the dataset can be detrimental though.

I could only imagine(!) one issue: Is there a use case where there are assets across a dataset with the same band identifier, but different band characteristics? That would kill references by band id.

I asked this before, but are we intending for Datasets to be required? Does every item need to belong to a Dataset?

Personally, I'd vote for: yes, required, but that's probably not really decided yet? The question is: Why should there be an item without a dataset as overlying entity? I haven't seen a single item yet that didn't had a "root catalog" referenced and we basically just replace and extend the root catalogs with the dataset spec, right?

matthewhanson commented 6 years ago

I'm having a hard time thinking that eo:bands should not be in Datasets. If it's not, then I'll revert to my previous position of needing a common-fields extension, because it's main purpose was all about being able to provide users with a list of available bands without having to look at individual items. If the bands are references in individual items then, as a user, you would have to look through every single item to get a complete list of all the bands that were available. Either that or you look at one and assume they are all the same. Ugh.

If there is a case where there are two bands with the same band identifier but different characteristics then there's a simple solution - the provider can rename the bands. If they want to use the same band id for bands that are different they probably shouldn't be in the data providing business. :-)

I also think that a Dataset should include a list of assets. This doesn't mean that all items will provide all assets, but it should be a comprehensive list of assets: the keys, the type, the bands it includes, everything except the actual href. This allows a user to look at the dataset and know what assets are available in general.

eo:cloud_cover as it is meant now doesn't belong in the Dataset, but if we do want to indicate some level of processing (ie cloud detection), then that falls into the same realm as indicating orthorectification, atmospheric correction, pan-sharpening, etc.

m-mohr commented 6 years ago

I'm having a hard time thinking that eo:bands should not be in Datasets.

Don't get me wrong: I do want to have the bands in the datasets and will fight for it. ;-) I am just trying to understand why @cholmes is sceptical about it and the mentioned reason was the only thing I could think of. So no need to revert your position - at least for me. I think we could remove the collections.

I also think that a Dataset should include a list of assets.

Not sure about this one. Are you talking about something like an asset schema? We have added a property asset_schema to the dataset placed within the EO extension. Why's that btw? Could you elaborate more on that, @simonff ?

matthewhanson commented 6 years ago

Not an asset schema, but something that indicates all the possible keys in the assets (if it stays a dictionary), their types (e.g., geotiff-cog). That way a user can look at the Dataset and see what possible assets are available in this Dataset. For example, this will tell me immediately if there's a truecolor TIFF available, if bands are separated, the format of the thumbnail, or any other asset.

Otherwise the user has to look at an individual item to see the assets, but given a single Item there's no guarantee that it is representative of all the potential assets in the Dataset. Now just like eo:bands, there's no expectation that every item has all those assets.

Use Case: I'm a user who wants the a true color TIFF file of the data. I can look at the Dataset and see what assets are available as COGs with RGB bands, and what the key is. Now I can grab a bunch of records and fetch those specific assets.

matthewhanson commented 6 years ago

Basically what I'm saying is that Datasets should give an indication of what a user can expect from Items that belong to that Dataset, both in terms of what assets are typically available as well as what spectral bands are included (if EO).

m-mohr commented 6 years ago

That sounds like a more fleshed out and complete version of the format property we previously had in the root catalogs. ;-) I agree that users should not need to look at an individual item to see the assets. Would you mind coming up with a draft for this @matthewhanson ?

matthewhanson commented 6 years ago

@m-mohr yes, will do. Got meetings all day today but I'll find some time to post an issue with some examples first so we can discuss.

cholmes commented 6 years ago

Yeah, I'm not really sure why this gives me such an uneasy feeling. It's some sort of hunch... Something around how I fear we're overloading too much but only really looking at it from one perspective - wanting really generic core, tight pieces. But most is probably mitigated by eo being an optional extension.

I think it'd be best to actually just get the dataset spec merged in sooner rather than later, to look at it in context. Even if it's not fully 'complete'. There's lots of really good ideas there and it'd be good to just have it in dev - which is meant to be an incomplete state.

I agree a draft for improvements on 'format' in the catalog as it is now would be great - thanks @matthewhanson

m-mohr commented 6 years ago

Well, I'm feeling the same as @cholmes. It is making dataset generation more complex and needs a pretty good description. On the other hand I see the benefits explained by @matthewhanson. It's really a mixed bag of feelings for me. I personally would vote neutral on this one for now, but will wait with +1/-1/+-0 until we have an example at least.

matthewhanson commented 6 years ago

Yes, just been busy here with Team meetings and prepping for workshops/talks next week at FOSS4G. Am going to try to get example and new issue posted this weekend.

joshfix commented 6 years ago

+1 to being able to lookup the band info for a dataset without having to obtain an item (or all items) to try to determine what they are from the actual data. Sounds almost like a "GetCapabilities" for a dataset. This also perfectly solves some workflows we're trying to work through for which I have had to write an external band-lookup service.

ghost commented 6 years ago

Whew, just read through most of the discussion. There is a lot to take in! Personally, I'm having trouble keeping up and I am trying to stay active. Is the Dataset concept intended to be in 0.6.0? I'm concerned it is still pretty unclear on what is being proposed and what problem we are solving.

On one hand, it sounds like there is an attempt to "normalize" the EO extension as a required part of each item? There is also a concept of metadata "inheritence"? And finally, there is a "root catalog" concept for navigating the structure of a catalog that we want to include some sort of "dataset" level metadata?

I can see some overlap on those issues and I think it would be great to solve all at once, but I don't want to get too far into the weeds.

STAC for me has always been about standardizing geospatial catalogs. I have non-raster data that I would like to catalog. If a text document can be pinned to a map and has a datetime attribute, is it part of a "dataset"?

I'm concerned that we are even talking about which fields are valid to put at each level. Is every extension going to have to define 2 schema? One for datasets and one for items? I'm much more of a fan of the "inheritance" model. I also think the "normalized" model of moving common metadata to a special location should be optional. Every item should stand on it's own. Not everything I want to index will be part of a dataset (in my mind). If I process two items together into a datafusion product, which dataset would it be part of?

I would really like to be able to perform a query (a static crawler could implement query just as easily as an indexed catalog) that returns all related items.

Take this query for instance:

?query={"eo:bands.center_wavelength":{"gt": 700,"lt":800}}

Does that apply to both the dataset definition and the item definition? One of the tenants of STAC was Developer first.

I'm not sure I'm helping the discussion here, but I'm also feeling much like @cholmes in that something doesn't feel right as is.

matthewhanson commented 6 years ago

We did decide that use of a Dataset should not be required.

I get where you are coming from @hgs-trutherford , and that a query against wavelengths could be useful, but is it in the Dataset or the Item? If in the Dataset does that mean I can't put eo:bands in the Item? What if I don't want to use Datasets, does that mean the spec doesn't allow me to include eo:bands in an Item, because it's part of a Dataset?

Overall I like the concept of Datasets, on the other hand I think it's a step backward to the idea of Collections and I'm not convinced the concept of Collections has really been appreciated here, because it is in fact quite flexible, fixes the issues @hgs-trutherford has brought up, and solves the same problems Datasets aim to solve.

In essence, Collections are datasets, however in the Collection spec nothing actually has to belong to the Collection. The STAC spec only adds fields to Items, so there's none of this confusion with eo:bands being part of the Dataset spec but not the Item spec. The EO extension simply adds new fields to an Item. A provider then can decide which one of these actually apply to the collection or dataset as a whole and can normalize the data by moving those fields to the Collection, but they do not have to and the spec stands on it's own as simply a bunch of Items.

Once we start defining a Dataset then and creating extensions that add to both Datasets and Items it does create the case where Datasets become required, if I want to use those fields. If instead we allow, e.g., eo:bands to be defined in either, then we're back to how Collections work and leaving it up the provider to decide how to define them.

The flip side of all of this, leaving Datasets out completely and putting all the fields in Items leaves us with a spec that I personally don't find very useful at all. For our use cases with our sat-utils projects, it's vital that we know what bands and assets are available at the Dataset/Collection level and know that they apply to all the Items within that Dataset/Collection.

I'm fine with using the name of Datsets, but as time goes on Datasets are really looking more and more like how Collections already work. Check out sat-api as examples if you haven't already: https://sat-api.developmentseed.org/search/stac https://sat-api.developmentseed.org/collections

What I want to make sure that we definitely want to avoid is dropping the concept of Collections and not adding Datasets to the next release because then I'll be left with a spec that I can no longer implement as a replacement for what I already have. I also want to avoid pushing off Datasets and end up with a large conceptual shift in the next release. We should be minimizing large changes as time goes on to avoid disruptions to existing implementations.

m-mohr commented 6 years ago

I'd like to clarify / comment on some questions from @hgs-trutherford:

Is the Dataset concept intended to be in 0.6.0?

Yes, it is.

I'm concerned it is still pretty unclear on what is being proposed and what problem we are solving.

Basically, we are trying to evolve the root catalog and describe datasets better. The removal of collections was proposed later as the dataset could make the collections extension to be used less frequently and they are potentially sharing concepts / have overlay in functionality. But it was never clear to me and others (#81) why there were collections and how they relate to calatogs / why they were not somehow merged.

On one hand, it sounds like there is an attempt to "normalize" the EO extension as a required part of each item?

Don't think so, but could be a misunderstanding.

There is also a concept of metadata "inheritence"?

Not sure whether I got your thinking here. What is inheriting from what?

And finally, there is a "root catalog" concept for navigating the structure of a catalog that we want to include some sort of "dataset" level metadata?

The basic idea was to replace the root catalog with the dataset, but it could also be just an improved version of the root catalog. Basically, the dataset is a catalog with more fleshed out fields than the root catalog. The root catalog hasn't received much love since being introduced and had several flaws. Overall, the dataset spec originates from a need to describe collection level metadata better.

I have non-raster data that I would like to catalog. If a text document can be pinned to a map and has a datetime attribute, is it part of a "dataset"?

This question is not really specific to the dataset and could also be asked for the recent version of STAC (0.5.2). Just replace "dataset" with "(root) catalog".

Is every extension going to have to define 2 schema?

That's a valid question we need to think about.

I also think the "normalized" model of moving common metadata to a special location should be optional.

It is.

Not everything I want to index will be part of a dataset (in my mind).

That's where different use cases are crashing together. Some are centered around items, some are more centered around datasets. Both are valid and should probably be catered for. The question is: How do we do that? What are the compromises?

If I process two items together into a datafusion product, which dataset would it be part of?

Maybe a new one?

Does that apply to both the dataset definition and the item definition?

Both?

What I want to make sure that we definitely want to avoid is dropping the concept of Collections and not adding Datasets to the next release

I'm pretty sure we won't do that. It would make absolutely no sense to drop collections and not have datasets.

We should be minimizing large changes as time goes on to avoid disruptions to existing implementations.

Sure, but we can only do that if everything is in good shape and well thought out. The current catalog spec seems to be just a very rough draft from the beginning, far from really being finished and very usable.

Overall, this issue is more and more shifting towards a general architecture discussion rather than just being about collections. It's good to discuss that, but probably not really the best place?

cholmes commented 6 years ago

Ok, first a recap on where I believe we are at with 'dataset'.

We released the first version of STAC with a set of optional fields in the 'catalog' to describe the items contained. In the second sprint we realized that it's useful to point at the catalog that actually uses those optional fields (and called it the 'root catalog', as it was usually at the root). In sprint 3 it became clear that there are cases where a catalog might have many 'root catalogs', so that name doesn't work so well. And we had a group work on a bunch of well thought out fields to describe a set of related items - the 'dataset'.

The core of what we need to do is clean up the 'catalog' definition and concepts. We've got lots of great pieces to work with - the fields in the dataset spec, the new /stac endpoint definition, and a more actual catalog implementations. I see that as top priority. I think there's still some work on core concepts and then exactly how to explain things (do we rename the root catalog the 'dataset'? Does dataset 'extend' catalog? Or does catalog 'add' the dataset fields? etc.)

And then we're also discussing 'extensions' to the core description of fields, which I think makes sense. We define core fields that are useful, and others can do whatever they want in that same catalog file.

Then the core of this issue is whether the 'common field merging' (collection extension) wants to make an assumption that you'll always want to 'merge' the fields specified in the 'dataset'. This has to be an 'extension', as it's adding a lot of complexity that we shouldn't expect everyone to do. Either we have people explicitly set the merge fields (and thus possibly duplicate some of the dataset fields), or we implicitly assume everything in 'dataset' gets merged (which I find makes dataset a bit too overloaded - it not only describes the items it contains, it is the set of additional fields that gets merged in). Or have some system that lets providers specify they want some fields merged.

Since bands seem to be the main driving use case of much of this then this determines will bands will live.

m-mohr commented 6 years ago

I took some time to think about this again and I have a concrete proposal for this. All of this is just my personal opinion.

Advantages

First of all, I'd like with some words about what benefits do collections offer at the moment:

Fields can be extracted from Items and stored at a separate/central location
File size reduction (based on 1)
Less work updating items when things change (based on 1)
Multiple collections could potentially influence one Item (multi inheritance)
Collections can be merged again into Items later (e.g. for indexing)
Items can reference any collection across datasets so collections can be shared between multiple catalogs or datasets.

Disadvantages

The third point form above can be a problem as it is not clear what happens in case of conflicts. Also, splitting into too many files can increase loading time.
Collections are somehow in conflict with the concept of datasets and catalogs and thus could confuse users.

Proposal

Let's add a new extension 'Common fields' with prefix (common or c) for datasets and catalogs. It has one field properties, which can hold the non-core fields (i.e. all fields with a prefix). Core fields can't be put in a collection and always need to stay in the Item. That has the following reasons:

The core fields seem important and should always be readable without requesting another file
naming conflicts, e.g. for provider that has different definitions
Core fields are usually very simple and just a few

In extensions we need to make sure that fields, that are not the same for datasets and items, need to have different names.

The collections extension would be removed.

Example (Simplified CBERS)

Dataset:

{
  "name":"CBERS4 MUX",
  "description":"Catalog of CBERS4 MUX camera imagery",
  "eo:bands":{
    "5":{
      "common_name":"blue"
    },
    "6":{
      "common_name":"green"
    },
    "7":{
      "common_name":"red"
    },
    "8":{
      "common_name":"nir"
    }
  },
  "eo:platform":"CBERS-4",
  "eo:instrument":"MUX",
  "eo:gsd":20.0,
  "eo:epsg":32645,
  "common:properties":["eo:platform","eo:instrument","eo:gsd","eo:bands","eo:epsg"],
  "links":[
    {
      "rel":"self",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/catalog.json"
    },
    {
      "rel":"root",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/catalog.json"
    },
    {
      "rel":"item",
      "href":"CBERS_4_MUX_20180709_067_101_L2.json"
    }
  ]
}

Item:

{
  "id":"CBERS_4_MUX_20180709_067_101_L2",
  "type":"Feature",
  "bbox":[42.377858,-1.508583,43.688396,-0.252353],
  "geometry":{
    "type":"MultiPolygon",
    "coordinates":[[[[42.379693,-1.336524],[43.454584,-1.499568],[43.685338,-0.435148],[42.61097,-0.272202],[42.379693,-1.336524]]]]
  },
  "properties":{
    "provider":"INPE",
    "datetime":"2018-07-09T07:23:45Z",
    "eo:sun_azimuth":46.8345,
    "eo:sun_elevation":55.0046,
    "eo:off_nadir":-0.00810879
  },
  "links":{
    "self":{
      "rel":"self",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/CBERS_4_MUX_20180709_067_101_L2.json"
    },
    "catalog":{
      "rel":"catalog",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/catalog.json"
    }
  },
  "assets":{
    "thumbnail":{
      "href":"https://s3.amazonaws.com/cbers-meta-pds/CBERS4/MUX/067/101/CBERS_4_MUX_20180709_067_101_L2/CBERS_4_MUX_20180709_067_101.jpg",
      "type":"jpeg"
    },
    "metadata":{
      "href":"s3://cbers-pds/CBERS4/MUX/067/101/CBERS_4_MUX_20180709_067_101_L2/CBERS_4_MUX_20180709_067_101_L2_BAND6.xml",
      "type":"xml"
    },
    "B5":{
      "href":"s3://cbers-pds/CBERS4/MUX/067/101/CBERS_4_MUX_20180709_067_101_L2/CBERS_4_MUX_20180709_067_101_L2_BAND5.tif",
      "type":"GeoTIFF",
      "format":"COG",
      "eo:bands":[
        "5"
      ]
    },
    "B6":{
      "href":"s3://cbers-pds/CBERS4/MUX/067/101/CBERS_4_MUX_20180709_067_101_L2/CBERS_4_MUX_20180709_067_101_L2_BAND6.tif",
      "type":"GeoTIFF",
      "format":"COG",
      "eo:bands":[
        "6"
      ]
    },
    "B7":{
      "href":"s3://cbers-pds/CBERS4/MUX/067/101/CBERS_4_MUX_20180709_067_101_L2/CBERS_4_MUX_20180709_067_101_L2_BAND7.tif",
      "type":"GeoTIFF",
      "format":"COG",
      "eo:bands":[
        "7"
      ]
    },
    "B8":{
      "href":"s3://cbers-pds/CBERS4/MUX/067/101/CBERS_4_MUX_20180709_067_101_L2/CBERS_4_MUX_20180709_067_101_L2_BAND8.tif",
      "type":"GeoTIFF",
      "format":"COG",
      "eo:bands":[
        "8"
      ]
    }
  }
}

Discussion

Advantages 1, 2, 3 and 5 are still covered by this. Multi inhertiance (4) would be dropped (if you can have just one parent, which is to be discussed), but it doesn't seem to be used or well-defined (merging conflicts?) anyway. 5 will probably a bit harder, but still manageable. We won't really get 6 completely integrated, but that also depends a little on whether we want that items/catalogs gave have multiple parents or not.

In the end that's the only way to get the best of both worlds, I think. I thought about many options, but really no one (including this one) made me completely happy. This one seemed to be the one with the least drawbacks. There are probably things that I missed ( @matthewhanson ) - let's figure that out!

ghost commented 6 years ago

@m-mohr I think I would be completely on board with this proposal if we just made common:properties an object and stored the values in there.

{
  "name":"CBERS4 MUX",
  "description":"Catalog of CBERS4 MUX camera imagery",
  "common:properties":{
    "eo:bands":{
      "5":{
        "common_name":"blue"
      },
      "6":{
        "common_name":"green"
      },
      "7":{
        "common_name":"red"
      },
      "8":{
        "common_name":"nir"
      }
    },
    "eo:platform":"CBERS-4",
    "eo:instrument":"MUX",
    "eo:gsd":20.0,
    "eo:epsg":32645,
  },
  "links":[
    {
      "rel":"self",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/catalog.json"
    },
    {
      "rel":"root",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/catalog.json"
    },
    {
      "rel":"item",
      "href":"CBERS_4_MUX_20180709_067_101_L2.json"
    }
  ]
}

This also helps with implementation. It is easier to do something like:

let merged = Object.assign({},item.properties, dataset["common:properties"])

I agree that multiple inheritence is good to get rid of, but we still have multiple levels of "polymorphism". It seems obvious to me, but it would be good to make sure we address the child catalog concept.

I would expect the root catalog common props to be overriden by child catalogs, which are in turn overridden by item props. Ideally we wouldn't have things defined in more than one location, but I think it is important to indicate how to resolve that conflict.

let merged = Object.assign({}, rootCatalogProps, childCatalogProps, itemProps)

m-mohr commented 6 years ago

@hgs-truthe01 I have not decided to go with this route due to the reason that it duplicates the place where this data can be added. Implementors that don't have items (data cube based as openEO or GEE) are storing these information not in the common:properties. So this could end up like this:

{
  "name":"CBERS4 MUX",
  "description":"Catalog of CBERS4 MUX camera imagery",
  "eo:bands":{
    "5":{
      "common_name":"blue"
    },
    "6":{
      "common_name":"green"
    },
    "7":{
      "common_name":"red"
    },
    "8":{
      "common_name":"nir"
    }
   },
  "eo:platform":"CBERS-4",
  "eo:instrument":"MUX",
  "eo:gsd":20.0,
  "eo:epsg":32645,
  "common:properties":{
    "eo:bands":{
      "5":{
        "common_name":"blue"
      },
      "6":{
        "common_name":"green"
      },
      "7":{
        "common_name":"red"
      },
      "8":{
        "common_name":"nir"
      }
    },
    "eo:platform":"CBERS-4",
    "eo:instrument":"MUX",
    "eo:gsd":20.0,
    "eo:epsg":32645,
  },
  "links":[
    {
      "rel":"self",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/catalog.json"
    },
    {
      "rel":"root",
      "href":"https://s3.amazonaws.com/cbers-stac/mux/catalog.json"
    },
    {
      "rel":"item",
      "href":"CBERS_4_MUX_20180709_067_101_L2.json"
    }
  ]
}

I agree that we need to resolve the conflicts somehow and need a way for that, but I think the most concrete (the "smallest piece") always wins. If we have mutli-inheritance from mutliple catalogs on the same level, conflicting entries should just not be taken into account, I think, but we better get rid of that.

cholmes commented 5 years ago

Just an update on this - we're going to rename 'dataset' to 'collection', and then the extension will be called 'collection properties', and the functionality of the common fields will be in a properties object in a dataset/collection.

matthewhanson commented 5 years ago

The new Commons extension replace the functionality the collections did before, PR #275

m-mohr commented 5 years ago

Finally, solved! :-)

radiantearth / stac-spec