radiantearth / stac-api-spec

SpatioTemporal Asset Catalog API specification - an API to make geospatial assets openly searchable and crawlable
http://stacspec.org
Apache License 2.0
213 stars 44 forks source link

Collection Search #145

Closed TomAugspurger closed 1 year ago

TomAugspurger commented 3 years ago

This is a feature-request / planning issue for adding collection-level search to the STAC API spec, if it's deemed in scope. This would mirror item-level search.

I'll leave detailed description of what this would look like to people who are more experienced with STAC and OGC API - Features, but I want to share a couple use-cases to help drive the discussion.

  1. Some datasets might have collection-level assets and no items (e.g. Zarr / NetCDF datasets) and might wish to expose some kind of search over many related datasets. For example, https://esgf-index1.ceda.ac.uk/search/cmip6-ceda/ shows a page where users can search over various properties of a collection of models (Institution ID, experiment ID, resolution, etc.)
  2. You might have a collection of related products (e.g. MODIS on GEE), and want to find the collection with some specific property (e.g. snow cover).
  3. In the chat @duckontheweb mentioned cataloging ML training datasets, and the ability to find collections by keyword and spatial extent. For ML training, users likely want the whole training dataset rather than individual items.
cholmes commented 3 years ago

Yeah, this is definitely in scope, and has been talked about a number of times. I think the current thought is to try to get the API spec to 1.0.0 as the priority. But if people are able to work on it and there's a solid proposal then we could try to squeeze it in for 1.0.0.

My main thinking on it has always been that 'collection search' is clearly the domain of OGC CSW, and its latest incarnation is OGC API - Records, so I've personally been waiting on that, and trying to nudge it in the right direction.

I think they've been aiming for a full standard with every detail specified. I've been interested in making a stripped down version that just takes the core fields they've articulated in https://github.com/opengeospatial/ogcapi-records/blob/master/core/standard/clause_7_core.adoc#response-5 (scroll down to table 8) and the GeoJSON representations clear, and also make an 'OGC Collection Metadata' mini-spec that just says how those fields go into an OGC Collection. Then the STAC endpoint for collection search would just be an endpoint (either a special Features API collection - collections/records/, or a collection-search/ to mirror item-search) that has all the same params as the other endpoints.

I think I may have some scope to write up the 'simple' OGC content spec as part of my OGC Fellowship, and it'd be great if people started experimenting with it in STAC.

philvarner commented 3 years ago

To record a case from the biweekly STAC mtg this morning:

For ML use cases, recording which type of data a collection is, e.g., source, training, testing, production, etc. A UI may then want to select only one of these to display, e.g., the production runs.

m-mohr commented 2 years ago

This is requested quite frequently, we discuss it often, but no one had the time yet to spec it out. I'm wondering whether we should spend some PSC money to have someone work on this specifically? cc @cholmes and other PSC members

Related issue in OGC API - Commons: https://github.com/opengeospatial/ogcapi-common/issues/69

chiarch84 commented 2 years ago

I totally agree on the need to have a search functionality for Collections. I think that the use cases described by @TomAugspurger are pretty common.

In particular from my side I would want to be able to search collections by:

The USGS search page shows through the GUI some nice possibilities to search in their catalogue (mainly made of collections). Once the collections APIs are defined they could be used in the STAC Browser for filtering existing collections. In our specific case for example we've got about 400 collections and the user has no way to find the ones more interesting for him. He can just browse the structure trying to find the catalogues that most could fit the collections he's looking for and hope to find them.

In general, while writing specifications for the collections APIs I would follow the same way used for Item search Apis with query parameters that resemble a lot the queries done through Elasticsearch APIs (with sortFields, filter, fields).

mkincyan commented 2 years ago

What is the status of this issue? Are there any extensions being worked on for this as we are planning to implement something of this nature.

m-mohr commented 2 years ago

No one had the time yet to really work on this. It would be nice to get a proposal out so happy if you could start the process.

chiarch84 commented 2 years ago

I'm not really sure about what you need as proposal. I can write here an idea of what I would expect for such kind of search. By copying from the Itemsearch specs I would expect something as the following:

GET /collections/search with possible fields for searching:

The result would be a list of Collections.

Of course if we have 2 searches, one for collections and one for items we will probably have to etiher differentiate urls (/collections/search and /items/search) or just keep one /search method with a mandatory parameter type that can have as value "items" or "collections" in order to understand on what type of objects to search.

Let me know if you need something different to start from.

m-mohr commented 1 year ago

I don't think it needs a separate endpoint such as GET /collections/search. I think GET /collections can simply be extended to support the additional query parameters etc

m-mohr commented 1 year ago

I've just created a repo for Collection Search so that we can create and discuss issues there: https://github.com/stac-api-extensions/collection-search

chiarch84 commented 1 year ago

I just thought it should be similar to Items search.

It is not very intuitive to have /search meaning to search in items and /collections with additional query parameters for searching in collections. I think the user would get confused.

m-mohr commented 1 year ago

I guess that's the legacy we need to live with, better might have been a top-level /items. But I don't see a good reason why we should add a /collections/search. It doesn't resolve the ambiguity of /search and it would clash with the /collections/:id path. We should also ask OGC what they would use...

philvarner commented 1 year ago

One downside of only supporting GET /collections with parameters like the /collections/{c_id}/items endpoint has now is that POST with a large geojson intersects would not be allows.

FWIW, the actual path of the endpoint shouldn't matter, since clients should be picking it up from the Landing Page links via a link relation anyway. I'd be in favor of /collection-search with a custom link rel of something like https://api.stacspec.org/v1.0.0-rc.1/extensions/collection-search/rel/search.

m-mohr commented 1 year ago

Hmm, why is it not allowed? It conflicts with Transaction, but on the other hand, you can (in theory) do content negotiation to avoid the conflict.

Why do we require Item Search to be at /search when it is available via links anyway? @philvarner

m-mohr commented 1 year ago

I pushed up a very lightweight and high-level description of a potential Collection Search README. Written in like 30mins, so feel free to discuss any changes and things that doesn't make sense. PRs welcome. There's likely a lot. For now, I just used /search/collections as the endpoint, but also asked @pvretano for his thoughts because I still think GET /collections would naturally be the best choice.

https://github.com/stac-api-extensions/collection-search/blob/main/README.md

philvarner commented 1 year ago

Hmm, why is it not allowed? It conflicts with Transaction, but on the other hand, you can (in theory) do content negotiation to avoid the conflicts.

It would conflict with a Collections Transaction extension (though not the Item Transaction extension), which I think we want to do. I think requring content negotiation makes this too complex.

Why do we require Item Search to be at /search when it is available via links anyway? @philvarner

I brought this up in the past, and I think the resolution was that explicitly defining it to be /search makes the openapi definition feasible. But, we could say that that endpoint name is just an example of what could be used.

philvarner commented 1 year ago

I think we could supporting GET /collections and GET & POST /search-collections (as indicated specifically by a link rel)

m-mohr commented 1 year ago

So OGC API is using GET /collections only, no POST for larger payloads.

So if we want to inherit from them, we need to do GET /collections and the POST equivalent for searching would be an issue that we need to solve in STAC ourselves, e.g. via content negotiation.

This also means you re-use the "data" relation type and you can use conformance classes to detect whether it supports additional queries etc.

See https://github.com/stac-api-extensions/collection-search/issues/2 for details...

m-mohr commented 1 year ago

I'd propose continuing discussions about Collection Search in https://github.com/stac-api-extensions/collection-search/issues to streamline the discussion around more specific issues.