microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
185 stars 10 forks source link

Duplicated items from pystac_client search if n_items > paging limit #163

Closed scottyhq closed 1 year ago

scottyhq commented 1 year ago

It seems that a planetary computer STAC API search returns duplicated items when paging is necessary. The default search setting is 100 items per page and it appears that subsequent pages include a repeat item:

import pystac_client
import planetary_computer
import geopandas as gpd

stac_client = pystac_client.Client.open('https://planetarycomputer.microsoft.com/api/stac/v1',
                                        modifier = planetary_computer.sign_inplace, #new!
                                       )
search = stac_client.search(
    collections = ['sentinel-1-rtc'],
    bbox = [-121.68811311,   46.83262832, -120.19715332,   47.84592105],
    datetime = '2019-01-01/2019-12-31',
    limit=100, #default=100, max=1000
)

items = search.get_all_items()
print(f'Returned {len(items)} items')
# Returned 341 items

gf = gpd.GeoDataFrame.from_features( items.to_dict(), crs='EPSG:4326')
gf['stac_ids'] = [item.id for item in items]
with gpd.pd.option_context('display.max_colwidth', None):
    display(gf[gf.stac_ids.duplicated()]['stac_ids'])

#200    S1B_IW_GRDH_1SDV_20190602T140414_20190602T140439_016518_01F17B_rtc
#300    S1B_IW_GRDH_1SDV_20190212T142035_20190212T142100_014914_01BD7D_rtc
#301    S1B_IW_GRDH_1SDV_20190211T014511_20190211T014536_014892_01BCC5_rtc

You can only increase the paging limit to 1000, so a workaround is to filter out the duplicates from the search results

TomAugspurger commented 1 year ago

Thanks for the report. Confirmed that I'm also seeing duplicates. We'll take a look.

scottyhq commented 1 year ago

I'll just add that perhaps I got the issue title wrong and this is restricted to certain STAC items in 2019... because we also noticed that if you omit the datetime keyword altogether no duplicates are returned...

mmcfarland commented 1 year ago

This should also be resolved by https://github.com/stac-utils/pgstac/pull/154, and rolled out to the Planetary Computer in the next week.

mmcfarland commented 1 year ago

This has also been resolved in the latest release:

import pystac_client
import planetary_computer
import geopandas as gpd
​
​
stac_client = pystac_client.Client.open('https://planetarycomputer.microsoft.com/api/stac/v1',
                                        modifier = planetary_computer.sign_inplace, #new!
                                       )
search = stac_client.search(
    collections = ['sentinel-1-rtc'],
    bbox = [-121.68811311,   46.83262832, -120.19715332,   47.84592105],
    datetime = '2019-01-01/2019-12-31',
    limit=100, #default=100, max=1000
)
​
items = search.get_all_items()
print(f'Returned {len(items)} items')
​
gf = gpd.GeoDataFrame.from_features( items.to_dict(), crs='EPSG:4326')
gf['stac_ids'] = [item.id for item in items]
with gpd.pd.option_context('display.max_colwidth', None):
    display(gf[gf.stac_ids.duplicated()]['stac_ids'])

# Returned 348 items
# Series([], Name: stac_ids, dtype: object)
floriandeboissieu commented 8 months ago

Is it possible that the issue is still there?

The following is a reproducible example (tested several times with the same result).

There are duplicates with the following:

import pystac_client
import planetary_computer
import pandas as pd

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)
bbox = [6.542560884238387, 47.90437963762056, 6.554888159664198, 47.909116654035785]
datetime='2020-01-01/2025-12-31'
cloud_nb=30
search = catalog.search(
        collections=["sentinel-2-l2a"],
        bbox=bbox,
        datetime=datetime,
        query={"eo:cloud_cover": {"lt": cloud_nb}},
        sortby="datetime",
)
coll = search.item_collection()
print(pd.Series([i.id for i in coll]).duplicated().any())
# True

But not with:

search = catalog.search(
        collections=["sentinel-2-l2a"],
        bbox=bbox,
        datetime=time_range,
        query={"eo:cloud_cover": {"lt": cloud_nb}},
        sortby="datetime",
        limit=1000
)
coll = search.item_collection()
print(pd.Series([i.id for i in coll]).duplicated().any())
# False
TomAugspurger commented 8 months ago

The bug causing this specific issue should have been fixed. But it's possible you're running into a similar issue, which was reported at https://github.com/microsoft/PlanetaryComputer/issues/301 and is still unresolved.

floriandeboissieu commented 8 months ago

You are right, commenting argument sortby="datetime" returns no duplicates.