opendatacube / datacube-explorer

Web-based exploration of Open Data Cube collections
Apache License 2.0
54 stars 31 forks source link

Explorer STAC API search issue: only returning max of 20 items #575

Closed robbibt closed 4 months ago

robbibt commented 4 months ago

A user on LinkedIn and @alexgleith have encountered a possible bug in our Explorer STAC search API (see link here).

If you do a super simple query of DEA's Sentinel-2 data from December 2023 to Feb 2024, you only get back data up to January 17, despite the data definitely existing:

import pystac_client, odc.stac

client = pystac_client.Client.open("https://explorer.sandbox.dea.ga.gov.au/stac")

# Search for items in the collection
collections = ["ga_s2am_ard_3", "ga_s2bm_ard_3"]
query = client.search(
    collections=collections,
    bbox=[146.04, -34.30, 146.05, -34.28],
    datetime="2023-12-01/2024-02-28",
)

# Search the STAC catalog for all items matching the query
[i.properties["datetime"] for i in query.get_items()]

image

It seems that by default, the query is only returning the first 20 items from the query. To get any extra data, the user has to manually provide a high limit, e.g.:

query = client.search(
    collections=collections,
    bbox=[146.04, -34.30, 146.05, -34.28],
    datetime="2023-12-01/2024-02-28",
    limit=1000,
)

image

This isn't typical behavior for STAC loading: normally when using pystac.client() it will automatically follow "next" page links to provide the user with all datasets matching their query - the user definitely isn't limited to a tiny amount like 20.

It looks to me that Explorer might be using the DEFAULT_PAGE_SIZE of 20 to define the absolute limit of datasets returned. This doesn't appear to follow the correct STAC API approach (see Slack conversation here and STAC API docs here). I can see this line which seems like it might the source of the issue - it seems to use DEFAULT_PAGE_SIZE if no limit is provided: https://github.com/opendatacube/datacube-explorer/blob/3cdcf98a7394eb85566609a4f9cbf6f22009b722/cubedash/_stac.py#L433

As it is, I think the current functionality is confusing to our users - they will naturally expect to get back all items matching their query (at least up to some sensibly high limit, definitely not 20), and only getting back half the time series is pretty unexpected.

robbibt commented 4 months ago

For reference, doing a similar search on either RadiantEarth or Microsoft Planetary Computer's STAC APIs sucessfully returns all relevant datasets with no restrictive limit:

import pystac_client, odc.stac

catalogue = "https://planetarycomputer.microsoft.com/api/stac/v1"
catalogue = "https://earth-search.aws.element84.com/v1"

client = pystac_client.Client.open(catalogue)

# Search for items in the collection
collections = ["sentinel-2-l2a"]
query = client.search(
    collections=collections,
    bbox=[146.04, -34.30, 146.05, -34.28],
    datetime="2023-12-01/2024-02-28",
)

# Search the STAC catalog for all items matching the query
[i.properties["datetime"] for i in query.get_items()]

image