gjoseph92 / stackstac

Turn a STAC catalog into a dask-based xarray
https://stackstac.readthedocs.io
MIT License
232 stars 49 forks source link

Stack Contains NaT in Time Field #235

Open J6767 opened 7 months ago

J6767 commented 7 months ago

I am having a problem with missing time values in my stack. Reproduce as follows:

from pystac_client import Client
import stackstac

def pystac_query(max_items, time_range, lon, lat, max_cloud):
    client = Client.open("https://earth-search.aws.element84.com/v1")

    query = client.search(
        max_items=max_items,
        collections=['sentinel-2-l1c'],
        query={"eo:cloud_cover": {"lt": max_cloud}},
        datetime = time_range,
        intersects=dict(type="Point", coordinates=[lon, lat]),
    )
    items = query.item_collection()
    return query, items

lat = 31.77877
lon = 5.99571
time_range =  '2022-08-04/2022-09-06'
query, items = pystac_query(99, time_range, lon, lat, 99)

The query contains 28 items.

for i in range(len(items)):
    t = items[i].properties["datetime"]

All of the items have times. All times are 27 char strings, apart from "2022-09-03T10:32:05Z" - is this the source of the bug?

However, when I stack them:

stack = stackstac.stack(items, epsg=4326)

...one of the items has its time replaced with a NaT.

stack.time()

array(['2022-08-04T10:32:05.634000000', '2022-08-04T10:32:08.099000000',
       '2022-08-06T10:22:16.627000000', '2022-08-06T10:22:18.455000000',
       '2022-08-09T10:32:12.888000000', '2022-08-09T10:32:15.351000000',
       '2022-08-11T10:22:08.959000000', '2022-08-11T10:22:10.785000000',
       '2022-08-14T10:32:04.138000000', '2022-08-14T10:32:06.617000000',
       '2022-08-16T10:22:17.889000000', '2022-08-16T10:22:19.720000000',
       '2022-08-19T10:32:13.767000000', '2022-08-19T10:32:16.217000000',
       '2022-08-21T10:22:06.588000000', '2022-08-21T10:22:08.412000000',
       '2022-08-24T10:32:01.411000000', '2022-08-24T10:32:03.910000000',
       '2022-08-26T10:22:20.430000000', '2022-08-26T10:22:21.453000000',
       '2022-08-29T10:32:13.577000000', '2022-08-29T10:32:16.024000000',
       '2022-08-31T10:22:06.328000000', '2022-08-31T10:22:08.153000000',
       '2022-09-03T10:32:02.517000000',                           'NaT',
       '2022-09-05T10:22:19.490000000', '2022-09-05T10:22:20.519000000'],
      dtype='datetime64[ns]')

This is causing an error when I try to crop the stack to my AOI. What is a suitable workaround for this? Many thanks.

clausmichele commented 7 months ago

You can drop them before trying to crop to the AOI like this:

https://github.com/Open-EO/openeo-processes-dask/blob/8d9e3fb009676a39c3ce5009bc1c09909e5bbfb3/openeo_processes_dask/process_implementations/cubes/_filter.py#L82

gjoseph92 commented 7 months ago

@clausmichele is right; a workaround for now is to drop the missing item. Or you could manually fix the input datetime properties to all be in the same format.

However, I consider this a stackstac bug. The problem indeed is with that one datetime near the end that doesn't have fractional seconds:

 '2022-09-03T10:32:02.517000Z',
 '2022-09-03T10:32:05Z',
 '2022-09-05T10:22:19.490000Z',
 '2022-09-05T10:22:20.519000Z']

The STAC spec just says that datetimes should be formatted according to RFC 3339, section 5.6. We can see there that time-secfrac is an optional part of partial-time, so this is still a valid datetime and should be parsed correctly.

I found that the issue seems to be related to the errors="coerce" argument to pd.to_datetime here, along with infer_datetime_format: https://github.com/gjoseph92/stackstac/blob/694c68646a685abc8f3455d6b28fe121e18ae775/stackstac/prepare.py#L408-L412

Weirdly, with pandas 1.3.5, using errors='raise' doesn't raise any error, but seems to parse correctly:

>>> ts = sorted(item.properties["datetime"] for item in items)

>>> pd.to_datetime(ts, infer_datetime_format=True, errors='raise')
DatetimeIndex(['2022-08-04 10:32:05.634000+00:00',
               ...
               '2022-09-03 10:32:02.517000+00:00',
                      '2022-09-03 10:32:05+00:00',
               '2022-09-05 10:22:19.490000+00:00',
               '2022-09-05 10:22:20.519000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

>>> pd.to_datetime(ts, infer_datetime_format=True, errors='coerce')
DatetimeIndex(['2022-08-04 10:32:05.634000', '2022-08-04 10:32:08.099000',
               '2022-08-06 10:22:16.627000', '2022-08-06 10:22:18.455000',
               '2022-08-09 10:32:12.888000', '2022-08-09 10:32:15.351000',
               '2022-08-11 10:22:08.959000', '2022-08-11 10:22:10.785000',
               '2022-08-14 10:32:04.138000', '2022-08-14 10:32:06.617000',
               '2022-08-16 10:22:17.889000', '2022-08-16 10:22:19.720000',
               '2022-08-19 10:32:13.767000', '2022-08-19 10:32:16.217000',
               '2022-08-21 10:22:06.588000', '2022-08-21 10:22:08.412000',
               '2022-08-24 10:32:01.411000', '2022-08-24 10:32:03.910000',
               '2022-08-26 10:22:20.430000', '2022-08-26 10:22:21.453000',
               '2022-08-29 10:32:13.577000', '2022-08-29 10:32:16.024000',
               '2022-08-31 10:22:06.328000', '2022-08-31 10:22:08.153000',
               '2022-09-03 10:32:02.517000',                        'NaT',
               '2022-09-05 10:22:19.490000', '2022-09-05 10:22:20.519000'],
              dtype='datetime64[ns]', freq=None)

However, after switching to pandas 2, we do get a helpful and informative error:

>>> pd.__version__
2.0.3

>>> In [7]: pd.to_datetime(ts, errors='raise')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-0ac50347e0b9> in <cell line: 1>()
----> 1 pd.to_datetime(ts, errors='raise')
...
ValueError: time data "2022-09-03T10:32:05Z" doesn't match format "%Y-%m-%dT%H:%M:%S.%f%z", at position 25. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

This makes sense: there are multiple datetime formats present, so infer_datetime_format (which became True by default in pandas 2) rightfully complains about this. The fact that it didn't raise an error in older versions was probably a bug.

Given that the STAC spec indicates all datetimes should be a subset of ISO 8601, format='ISO8601' seems like just what we want:

>>> pd.to_datetime(ts, errors='raise', format='ISO8601')
DatetimeIndex(['2022-08-04 10:32:05.634000+00:00',
               ...
               '2022-08-31 10:22:08.153000+00:00',
               '2022-09-03 10:32:02.517000+00:00',
                      '2022-09-03 10:32:05+00:00',
               '2022-09-05 10:22:19.490000+00:00',
               '2022-09-05 10:22:20.519000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

format='ISO8601' is only available in pandas 2, so that's a compelling reason to require pandas 2, xref https://github.com/gjoseph92/stackstac/issues/213. (https://github.com/pandas-dev/pandas/issues/41047 is also fixed in newer versions of pandas, allowing us to remove some old code for https://github.com/gjoseph92/stackstac/issues/2).

I think I'll probably fix this by using format='ISO8601' and requiring pandas 2.

J6767 commented 7 months ago

Thanks guys, that worked great. stack = stack.where(~np.isnat(stack.time), drop=True)

gjoseph92 commented 7 months ago

Glad that works for you, but dropping data doesn't seem like an ideal workaround to me, so I'll reopen this until the underlying problem is fixed.

ZZMitch commented 2 months ago

To provide a bit more context to this bug, I have been using stackstac to document HLS (Landsat 8/9 + Sentinel-2) clear observation availability over Canada and came across these NaT time-steps.

My "fix" has been to drop the NaT time-steps in a similar manner as demonstrated here, and I have documenting how prevalent they are across the HLS archive. Originally, I thought this was a metadata issue within the HLS archive, and only recently realized this was a stackstac bug, otherwise I would have fixed the missing dates, rather than removing them.

For HLS, this impacts Sentinel-2 more than Landsat. Across the couple years of imagery I have documented so far (2018 - 2020), it impacts <0.1% of Landsat images and 0.2 - 1% of Sentinel-2 images depending on the year. It seems to be more prevalent in 2018 for S30 (1%), dropping to 0.2% by 2020.

Most of the time, the NaT time-steps are a couple images here and there (e.g., <5 in a year looking at all available imagery). However, I have noticed a few cases where almost every time-step in a year is NaT.

Here is a basic reproducible example of this case:

lon, lat = -91.4888563, 48.8885808 # All but first S30 will be NaT datetime
#lon, lat = -89.4888563, 48.8885808 # No NaT datetime

start = '2019-01-01' # In 2020 or starting on 01-03, no issue
end = '2019-12-31'

import os
import pystac_client as pc

gdalEnv = ss.DEFAULT_GDAL_ENV.updated(dict(
    GDAL_DISABLE_READDIR_ON_OPEN = 'TRUE',
    GDAL_HTTP_COOKIEFILE = os.path.expanduser('~/cookies.txt'),
    GDAL_HTTP_COOKIEJAR = os.path.expanduser('~/cookies.txt'),
    GDAL_HTTP_MAX_RETRY = 10,
    GDAL_HTTP_RETRY_DELAY = 15))

catalog = pc.Client.open('https://cmr.earthdata.nasa.gov/stac/LPCLOUD')

itemsS30 = catalog.search(intersects = dict(type = 'Point', coordinates = [lon, lat]), datetime = f'{start}/{end}', collections = ['HLSS30.v2.0'], limit = 100).item_collection()
#itemsS30 # Can see datetime values in all images

import stackstac as ss

s30 = ss.stack(itemsS30, assets = ['Fmask'], epsg = 32615, resolution = 30, gdal_env = gdalEnv)
#s30 # Will see #NaT date-times here

s30.time.values # All but first are NaT

In this case, the first datetime in the stack is '2019-01-02T17:10:02.000000000'. It appears that since this date time does not have fractional seconds, and it is the first datetime in the stack, all the other datetimes are considered invalid and become NaT. Obviously a much more problematic result than having just one time here and there becoming NaT! These situations are pretty rare (I have noticed it only 3-4 times across the few years of work I have done over Canada so far) - but wanted to mention here now that I know it is a stackstac issue rather than an HLS metadata issue.

When doing NRT work or phenology, for-example, every image counts! So good to document and hopefully fix these edge cases.

martlaf commented 3 weeks ago

Agreed that it should be stackstac's responsibility to parse correctly, in stackstac.prepare.to_coords using e.g. pd.to_datetime(_, format="mixed") rather than pd.to_datetime(_, infer_datetime_format=True) (which raises a deprecation warning anyway).

In the meantime, using pystac.ItemCollection and pystac.Item, when stackstac.stac_types.items_to_plain calls item.to_dict() on it, the item will do self.properties["datetime"] = datetime_to_str(self.datetime). Since this function passes through the datetime.isoformat(timespec=) argument, I have temporarily patched it in my code with the following line before calling stackstac.stack():

 pystac.item.datetime_to_str = functools.partial(pystac.item.datetime_to_str, timespec="microseconds")

This does ensure all dates are consistently formatted and I get my expected xr.DataArray