Make assets optional - Githubissues

m-mohr commented 6 years ago

As a follow-up for discussions in #81 and #148: We should consider making assets optional.

As an growing amount of EO data processing is moving more and more into the cloud it is not necessary to download the data at all. Some providers only offer data after executing certain functionality or have some other access restrictions in place. Still metadata descriptions are needed for the data. Therefore I propose to remove the requirement to have at least one asset in a catalog/item.

Precisely, I am speaking about this definition from the catalog:

All static catalogs must contain at least 1 Asset, as the point of the SpatioTemporal Asset Catalog is to be link to actual actual data, not to just reference metadata (though it is not required that all users have permissions to access the asset).

And this definition from the Item spec:

Dict of asset objects that can be be download (at least one required, thumbnail strongly recommended), each with a unique key.

This change would be helpful to several providers/services, e.g. Google Earth Engine, openEO and GDBX.

matthewhanson commented 6 years ago

+1 on this, but we should still strongly encourage a thumbnail be supplied if it makes sense. Thumbnails are very useful for users in the case of EO data where the data can be screened for quality and cloud coverage prior to processing/downloading. However some data sources a thumbnail isn't as useful (e.g., SAR data)

cholmes commented 6 years ago

I'm hesitant on this one. Mostly because I think the problem with lots of catalogs in the past was that they didn't actually link to any data, and that was ok. This lead to a lot of useless catalogs, as you couldn't actually get much data from them. I like that we encourage providers to actually put something up. And like Matt said it can just be a thumbnail - I had previously wanted an asset plus a required thumbnail.

I do believe it is ok to link to asset data that is behind auth - it's an asset with a link, but not everyone can access it. And I also believe it's ok for the assets to be instantiated after an operation is run. We do this in Planet (though we're not STAC compliant, but we provide a link before it is activated). I think there are several patterns that could be employed to give a 'link' to an asset, even if it's instantiated on the fly. Indeed I think I'd even be ok if the asset link is just an 'activation asset' that makes some other link.

But I'd love to understand the GEE / OpenEO / GBDX use case. My understanding that for GEE and OpenEO are just operating at the 'dataset' level, and that STAC doesn't really make sense? Like there's no search of individual scenes / assets, it's just the higher level layer.

For my understanding of GBDX it generates products based on a set of operations. But I'd see you using STAC for the output of GBDX, not cataloging all the things that could be produced?

I guess in line with STAC changes in general, I'd like to see the full 'non-compliant' catalog of valuable information that can't provide any assets, that requires us to drop the requirement. And to understand why it'd be so hard for them to have some sort of link to an asset. But I'm probably inclined to say 'use the dataset spec to describe it', and the rest of STAC may just not apply.

m-mohr commented 6 years ago

Fair enough, these are certainly valid points. As the dataset spec is somehow the successor of the root static catalog, I would like to know whether you want to have this restriction also in the dataset spec. If it is fine to just operate the dataset spec in the scope of STAC and only the introduction of sub-catalogs introduces the requirement to include an asset, I am probably fine with it. Just want to clarify that for openEO. Not sure about GEE and GDBX though.

matthewhanson commented 6 years ago

GBDX requires the use of function calls with the gbdxtools library to first activate a scene (ordering), then you can perform some operations on it and the final output of that is put in an s3 bucket where it can downloaded.

I've currently got a new library that we'll be making open-source in the next couple months called sat-gbdx which is a STAC wrapper around gbdxtools and works the same way as sat-search. It basically makes the entire GBDX catalog queryable as if it were a dynamic STAC catalog. Working with some DG folks to review it before we make it public.

But the way it works is that you query their API and it creates the FeatureCollection of STAC items which you can save as GeoJSON, but it contains no assets other than the thumbnail. You can then later load that saved file and download specific bands in those Items, sat-gbdx will order the scenes, request whatever processing is desired (i.e. atm correction, pan-sharpen), clipped to the requested AOI, then when the scene is ready it will populate the asset with the s3 link.

cholmes commented 6 years ago

@m-mohr - yes, in my mind you can definitely just use the dataset spec without an asset. I see the dataset spec as almost 'below' the STAC spec - it describes datasets, and then you can add on core STAC to describe items within the dataset. So you can have a catalog without items, you just can't have an item without an asset.

@matthewhanson - that sounds super cool. No asset other than thumbnail is fine in the current spec, it's the truly 'no asset' (not even thumbnail) that I'm opposed to. So I think your use case is good - starts with a single asset, and when more assets are populated they get added.

m-mohr commented 6 years ago

Great, so we basically agree?!

Datasets can be standalone
Catalogs requires at least one link to an item (the reference to assets seems a little misleading in the current spec) or sub-catalog
Items require at least one asset, e.g. a thumbnail

Do Items require a dataset? (That's basically one of our primary questions in the Dataset gitter)

This can be closed from my side if we basically agree on that.

cholmes commented 6 years ago

I'm good with what you say @m-mohr

I'm still back and forth on whether items should require a dataset. I definitely want to say that, but am not sure if there's use cases I'm missing. I suppose someone can make pretty lame datasets definitions, as there's few required fields.

m-mohr commented 6 years ago

Lame definitions can also be made for Items and people get very creative once they need to fill required fields they don't want to fill for whatever reason.

Regarding requiring data sets for items: I am also okay with not requiring them for now, but maybe put a strong recommendation. But I only see use cases for standalone items in case you have just like one or two items and nothing more. And in the end, for one item you could just copy over all the required data for the dataset from the item. That could be a lame dataset, depending on how you define lame... ;-)

matthewhanson commented 6 years ago

I think Datasets should always be used, but not sure if the spec should actually require them. So for now we should aim for STAC being able to work without them for now.

cholmes commented 6 years ago

Cool. Sounds good on strongly recommended - let's remember to get that language into the spec.

I'll close this issue, as it sounds like we're aligned on an asset is required, but it's cool if it's just a thumbnail.

m-mohr commented 6 years ago

Have the "strongly recommended" on my dataset to-do list.

radiantearth / stac-spec

Make assets optional #187