Closed scottyhq closed 1 year ago
Thanks for the report. Confirmed that I'm also seeing duplicates. We'll take a look.
I'll just add that perhaps I got the issue title wrong and this is restricted to certain STAC items in 2019... because we also noticed that if you omit the datetime
keyword altogether no duplicates are returned...
This should also be resolved by https://github.com/stac-utils/pgstac/pull/154, and rolled out to the Planetary Computer in the next week.
This has also been resolved in the latest release:
import pystac_client
import planetary_computer
import geopandas as gpd
stac_client = pystac_client.Client.open('https://planetarycomputer.microsoft.com/api/stac/v1',
modifier = planetary_computer.sign_inplace, #new!
)
search = stac_client.search(
collections = ['sentinel-1-rtc'],
bbox = [-121.68811311, 46.83262832, -120.19715332, 47.84592105],
datetime = '2019-01-01/2019-12-31',
limit=100, #default=100, max=1000
)
items = search.get_all_items()
print(f'Returned {len(items)} items')
gf = gpd.GeoDataFrame.from_features( items.to_dict(), crs='EPSG:4326')
gf['stac_ids'] = [item.id for item in items]
with gpd.pd.option_context('display.max_colwidth', None):
display(gf[gf.stac_ids.duplicated()]['stac_ids'])
# Returned 348 items
# Series([], Name: stac_ids, dtype: object)
Is it possible that the issue is still there?
The following is a reproducible example (tested several times with the same result).
There are duplicates with the following:
import pystac_client
import planetary_computer
import pandas as pd
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=planetary_computer.sign_inplace,
)
bbox = [6.542560884238387, 47.90437963762056, 6.554888159664198, 47.909116654035785]
datetime='2020-01-01/2025-12-31'
cloud_nb=30
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=bbox,
datetime=datetime,
query={"eo:cloud_cover": {"lt": cloud_nb}},
sortby="datetime",
)
coll = search.item_collection()
print(pd.Series([i.id for i in coll]).duplicated().any())
# True
But not with:
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=bbox,
datetime=time_range,
query={"eo:cloud_cover": {"lt": cloud_nb}},
sortby="datetime",
limit=1000
)
coll = search.item_collection()
print(pd.Series([i.id for i in coll]).duplicated().any())
# False
The bug causing this specific issue should have been fixed. But it's possible you're running into a similar issue, which was reported at https://github.com/microsoft/PlanetaryComputer/issues/301 and is still unresolved.
You are right, commenting argument sortby="datetime"
returns no duplicates.
It seems that a planetary computer STAC API search returns duplicated items when paging is necessary. The default search setting is 100 items per page and it appears that subsequent pages include a repeat item:
You can only increase the paging limit to 1000, so a workaround is to filter out the duplicates from the search results