microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
183 stars 8 forks source link

modis-43A4-061 search not honoring datetime #335

Open scottyhq opened 6 months ago

scottyhq commented 6 months ago

I expect this search to only return acquisitions from 2001-01-01 (299 according to https://e4ftl01.cr.usgs.gov/MOTA/MCD43A4.061/2001.01.01/)

search = catalog.search(
    collections=["modis-43A4-061"],
    datetime='2001-01-01'
)
items = search.item_collection()
len(items) # 4847
gf = gpd.GeoDataFrame.from_features(items.to_dict(), crs="epsg:4326")
print(gf.datetime.unique())
 array(['2001-01-09T00:00:00Z', '2001-01-08T00:00:00Z',
       '2001-01-03T23:59:59.999500Z', '2001-01-07T00:00:00Z',
       '2001-01-02T23:59:59.999500Z', '2001-01-06T00:00:00Z',
       '2001-01-01T23:59:59.999500Z', '2001-01-05T00:00:00Z',
       '2000-12-31T23:59:59.999500Z', '2001-01-04T00:00:00Z',
       '2000-12-30T23:59:59.999500Z', '2001-01-03T00:00:00Z',
       '2000-12-29T23:59:59.999500Z', '2001-01-02T00:00:00Z',
       '2000-12-28T23:59:59.999500Z', '2001-01-01T00:00:00Z'],
      dtype=object)

It's strange that a large range of dates are returned. I'm guessing there might be both duplicate items from different collection updates for a single date ('2001-01-01T23:59:59.999500Z' vs '2001-01-01T00:00:00Z') , but also don't know why this search appears to be about +/- 1 week from the specified date...

TomAugspurger commented 6 months ago

I wonder if this is because the items have start and end datetimes?

I think the API is picking up any items whose time range covers your datetime (https://github.com/stac-utils/pgstac/issues/5)? While https://e4ftl01.cr.usgs.gov/MOTA/MCD43A4.061/2001.01.01/ is giving just the items whose end_datetime equals 2021.01.01 (or maybe it's the start_datetime)?

In [18]: search = catalog.search(
    ...:     collections=["modis-43A4-061"],
    ...:     datetime='2001-01-01',
    ...:     query={"end_datetime": {"eq": "2001-01-16T23:59:59.999999Z"}}
    ...: )
    ...: items2 = search.item_collection()

In [19]: len(items2)
Out[19]: 311

I'm not sure why it's 311 instead of 299, but that's at least much closer.

scottyhq commented 6 months ago

Thanks for looking @TomAugspurger! In my haste I didn't fully consult the user guide https://www.umb.edu/spectralmass/v006/mcd43a4-nbar-product/ which does clearly state:

Unlike the earlier reprocessed versions (where the date of the product signifies the first day of the retrieval period), and the Direct Broadcast version (where the date signifies the last day of the retrieval period), the date associated with each daily V006 and V006.1 retrieval is the center of the moving 16 day input window.

But I still find the API behavior counterintuitive. If both datetime as well as start_datetime and end_datetime exist in the metadata I'd expect a query on datetime to only consider that field? A workaround to hone in on a nominal date is to fully specify a +/- 8 day window and query on both start and end:

date = pd.to_datetime('2001-01-01')
start = (date - pd.Timedelta(days=8)).isoformat(timespec='microseconds')+'Z'
end = (date + pd.Timedelta(days=8) - pd.Timedelta(seconds=1)).isoformat()+'.999999Z'
print(start, end) 
# 2000-12-24T00:00:00.000000Z 2001-01-08T23:59:59.999999Z

search = catalog.search(
    collections=["modis-43A4-061"],
    query={"start_datetime": {"eq": start},
           "end_datetime":   {"eq": end},
          },
)

items = search.item_collection()
print(len(items))
# 299

gf = gpd.GeoDataFrame.from_features(items.to_dict(), crs="epsg:4326")
print(gf.datetime.unique())
# ['2001-01-01T00:00:00Z']
TomAugspurger commented 5 months ago

It does make it a bit awkward to specify the exact search you want :/ I believe this behavior comes from the STAC API spec though, so not much we can do about it.