gjoseph92 / stackstac

Turn a STAC catalog into a dask-based xarray
https://stackstac.readthedocs.io
MIT License
232 stars 49 forks source link

Issues with numpy 1.24.0 #194

Closed julianblue closed 1 year ago

julianblue commented 1 year ago

Hi @gjoseph92 , I am getting an error when trying to stack data when the latest numpy version 1.24.0 is installed. All works fine with 1.23.5.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[48], line 1
----> 1 stack = stackstac.stack(itms.get_all_items(), resolution=10)

File ~/anaconda3/envs/ag-ml/lib/python3.8/site-packages/stackstac/stack.py:311, in stack(items, assets, epsg, resolution, bounds, bounds_latlon, snap_bounds, resampling, chunksize, dtype, fill_value, rescale, sortby_date, xy_coords, properties, band_coords, gdal_env, errors_as_nodata, reader)
    287 asset_table, spec, asset_ids, plain_items = prepare_items(
    288     plain_items,
    289     assets=assets,
   (...)
    294     snap_bounds=snap_bounds,
    295 )
    296 arr = items_to_dask(
    297     asset_table,
    298     spec,
   (...)
    306     errors_as_nodata=errors_as_nodata,
    307 )
    309 return xr.DataArray(
    310     arr,
--> 311     *to_coords(
    312         plain_items,
    313         asset_ids,
    314         spec,
    315         xy_coords=xy_coords,
    316         properties=properties,
    317         band_coords=band_coords,
    318     ),
    319     attrs=to_attrs(spec),
    320     name="stackstac-" + dask.base.tokenize(arr),
    321 )

File ~/anaconda3/envs/ag-ml/lib/python3.8/site-packages/stackstac/prepare.py:500, in to_coords(items, asset_ids, spec, xy_coords, properties, band_coords)
    496     except KeyError:
    497         pass
    499 coords.update(
--> 500     accumulate_metadata.metadata_to_coords(
    501         flattened_metadata_by_asset,
    502         "band",
    503         skip_fields={"href"},
    504         # skip_fields={"href", "title", "description", "type", "roles"},
    505     )
    506 )
    507 if any(eo_by_asset):
    508     coords.update(
    509         accumulate_metadata.metadata_to_coords(
    510             eo_by_asset,
   (...)
    513         )
    514     )

File ~/anaconda3/envs/ag-ml/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:29, in metadata_to_coords(items, dim_name, fields, skip_fields)
     23 def metadata_to_coords(
     24     items: Iterable[Mapping[str, object]],
     25     dim_name: str,
     26     fields: Union[str, Sequence[str], Literal[True]] = True,
     27     skip_fields: Container[str] = (),
     28 ) -> Dict[str, xr.Variable]:
---> 29     return dict_to_coords(
     30         accumulate_metadata(
     31             items,
     32             fields=[fields] if isinstance(fields, str) else fields,
     33             skip_fields=skip_fields,
     34         ),
     35         dim_name,
     36     )

File ~/anaconda3/envs/ag-ml/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:168, in dict_to_coords(metadata, dim_name)
    163     except TypeError:
    164         # if it's not set-able, just give up
    165         break
    167 props_arr = np.squeeze(
--> 168     np.array(
    169         props,
    170         # Avoid DeprecationWarning creating ragged arrays when elements are lists/tuples of different lengths
    171         dtype="object"
    172         if (
    173             isinstance(props, _ourlist)
    174             and len(set(len(x) for x in props if isinstance(x, (list, tuple))))
    175             > 1
    176         )
    177         else None,
    178     )
    179 )
    181 if (
    182     props_arr.ndim > 1
    183     or props_arr.ndim == 1
   (...)
    187     # our "bands", "y", and "x" dimensions, and xarray won't let us use unrelated
    188     # dimensions. so just skip it for now.
    189     continue

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (17,) + inhomogeneous part.
gjoseph92 commented 1 year ago

Thanks for reporting @julianblue. This maybe seems like https://github.com/numpy/numpy/pull/22004:

Ragged array creation will now always raise a ValueError unless dtype=object is passed. This includes very deeply nested sequences. https://numpy.org/doc/stable/release/1.24.0-notes.html

However I'm confused, since we do already pass dtype='object' (note quotes; I hope removing them doesn't fix this, but I guess it's worth trying). I added that to avoid a DeprecationWarning from NumPy, so I'm a little surprised this has suddenly broken with no warning.

I won't be able to work on this for a few weeks. For now, pinning NumPy is a fine workaround; ideally we'd make a release doing that. If anyone wants to take a look at this sooner, please post what you find here.

TomAugspurger commented 1 year ago

Do you have a reproducible example @julianblue? I can't reproduce using stackstac main and this simple example:

In [17]: import stackstac, pystac_client

In [18]: client = pystac_client.Client.open("http://planetarycomputer.microsoft.com/api/stac/v1/")

In [19]: ic = client.search(collections="sentinel-2-l2a", max_items=2).item_collection()

In [20]: stackstac.stack(ic, assets=["B02", "B04"])
Out[20]:
<xarray.DataArray 'stackstac-eb5fcb5108bcbbdc7f4efca13682f66b' (time: 2,
                                                                band: 2,
                                                                y: 30978,
                                                                x: 20982)>
dask.array<fetch_raster_window, shape=(2, 2, 30978, 20982), dtype=float64, chunksize=(1, 1, 1024, 1024), chunktype=numpy.ndarray>
Coordinates: (12/44)
  * time                                     (time) datetime64[ns] 2022-12-24...
    id                                       (time) <U54 'S2B_MSIL2A_20221224...
  * band                                     (band) <U3 'B02' 'B04'
  * x                                        (x) float64 5e+05 ... 7.098e+05
  * y                                        (y) float64 2e+06 ... 1.69e+06
    s2:not_vegetated_percentage              (time) float64 0.3278 0.0
    ...                                       ...
    gsd                                      float64 10.0
    proj:shape                               object {10980}
    common_name                              (band) <U4 'blue' 'red'
    center_wavelength                        (band) float64 0.49 0.665
    full_width_half_max                      (band) float64 0.098 0.038
    epsg                                     int64 32719
Attributes:
    spec:        RasterSpec(epsg=32719, bounds=(499980.0, 1690240.0, 709800.0...
    crs:         epsg:32719
    transform:   | 10.00, 0.00, 499980.00|\n| 0.00,-10.00, 2000020.00|\n| 0.0...
    resolution:  10.0
julianblue commented 1 year ago

@TomAugspurger I did not specify the assets to pull all available bands. Specifying the assets does NOT cause the error to be raised @gjoseph92 .

stack = stackstac.stack(itm_coll, resolution=10)
julianblue commented 1 year ago

@TomAugspurger , @gjoseph92 ,

the code fails in the metadata conversion for proj:transform (not present in the preview band)

 [[10.0, 0.0, 399960.0, 0.0, -10.0, 5700000.0], [60.0, 0.0, 399960.0, 0.0, -60.0, 5700000.0], [10.0, 0.0, 399960.0, 0.0, -10.0, 5700000.0], [10.0, 0.0, 399960.0, 0.0, -10.0, 5700000.0], [10.0, 0.0, 399960.0, 0.0, -10.0, 5700000.0], [20.0, 0.0, 399960.0, 0.0, -20.0, 5700000.0], [20.0, 0.0, 399960.0, 0.0, -20.0, 5700000.0], [20.0, 0.0, 399960.0, 0.0, -20.0, 5700000.0], [10.0, 0.0, 399960.0, 0.0, -10.0, 5700000.0], [60.0, 0.0, 399960.0, 0.0, -60.0, 5700000.0], [20.0, 0.0, 399960.0, 0.0, -20.0, 5700000.0], [20.0, 0.0, 399960.0, 0.0, -20.0, 5700000.0], [20.0, 0.0, 399960.0, 0.0, -20.0, 5700000.0], [20.0, 0.0, 399960.0, 0.0, -20.0, 5700000.0], [10.0, 0.0, 399960.0, 0.0, -10.0, 5700000.0], [10.0, 0.0, 399960.0, 0.0, -10.0, 5700000.0], None]

The none in the list should ideally not exist for raster band data and should be a transform like the other bands?

This causes

props_arr = np.squeeze(
                np.array(
                    props,
                    # Avoid DeprecationWarning creating ragged arrays when elements are lists/tuples of different lengths
                    dtype="object"
                    if (
                        isinstance(props, _ourlist)
                        and len(set(len(x) for x in props if isinstance(x, (list, tuple))))
                        > 1
                    )
                    else None,
                )
            )

to set dtype=None and raises the error. Shouldn't the dtype in that case be set to "object" as well instead of none?

gjoseph92 commented 1 year ago

@julianblue can you give https://github.com/gjoseph92/stackstac/pull/195 a try? I don't have a computer available, so just did this on my phone and can't test it out. I feel like something along these lines might work (not sure if this is exactly the thing to do, though).

julianblue commented 1 year ago

Hi @gjoseph92 ,

195 fixed the issue for me.

hrodmn commented 1 year ago

I hit this issue today on stackstac==0.4.3. It is happening when I use the query extension to filter out Landsat 7 items from the landsat-c2-l2 collection in the Planetary Computer STAC. Other operations that use the query arg (e.g. filtering on eo:cloud_cover) do not cause the failure.

Here is a reprex:

import planetary_computer
import pystac_client
import stackstac

COLLECTION = "landsat-c2-l2"
BBOX = (-91.67481115470213, 47.86318699029263, -91.52471085746416, 47.93118792159696)
DATE_RANGE = ["2022-07-01", "2022-07-31"]
EPSG = 5070
RESOLUTION = 30
ASSETS = ["red", "green", "blue", "qa_pixel"]
BOUNDS_5070 = (326000, 2771000, 337000, 2778000)

planetary_computer_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)

# include all items
stac_items = planetary_computer_catalog.search(
    collections=[COLLECTION],
    bbox=BBOX,
    datetime=DATE_RANGE,
).item_collection()

print(f"there are {len(stac_items)} items if we do not apply any filters")

# works
landsat_stack = stackstac.stack(
    items=stac_items,
    assets=ASSETS,
    epsg=EPSG,
    resolution=RESOLUTION,
    bounds=BOUNDS_5070,
    xy_coords="center",
)
print(landsat_stack)

# filter out cloudy items
stac_items_less_cloudy = planetary_computer_catalog.search(
    collections=[COLLECTION],
    bbox=BBOX,
    datetime=DATE_RANGE,
    query={"eo:cloud_cover": {"lte": 50}},
).item_collection()

print(f"there are {len(stac_items_less_cloudy)} items if we exclude cloudy images")

# works
landsat_stack_filt = stackstac.stack(
    items=stac_items_less_cloudy,
    assets=ASSETS,
    epsg=EPSG,
    resolution=RESOLUTION,
    bounds=BOUNDS_5070,
    xy_coords="center",
)
print(landsat_stack_filt)

# exclude Landsat 7
stac_items_no_landsat_7 = planetary_computer_catalog.search(
    collections=[COLLECTION],
    bbox=BBOX,
    datetime=DATE_RANGE,
    query={"platform": {"neq": "landsat-7"}},
).item_collection()

print(f"there are {len(stac_items_no_landsat_7)} items if we exclude Landsat 7")

# fails with inhomogenous error
landsat_stack_filt = stackstac.stack(
    items=stac_items_no_landsat_7,
    assets=ASSETS,
    epsg=EPSG,
    resolution=RESOLUTION,
    bounds=BOUNDS_5070,
    xy_coords="center",
)
print(landsat_stack_filt)

Output:

there are 10 items if we include Landsat 7
<xarray.DataArray 'stackstac-5060345c8d58f7556b298868d9999b98' (time: 10,
                                                                band: 4,
                                                                y: 234, x: 368)>
dask.array<fetch_raster_window, shape=(10, 4, 234, 368), dtype=float64, chunksize=(1, 1, 234, 368), chunktype=numpy.ndarray>
Coordinates: (12/27)
  * time                         (time) datetime64[ns] 2022-07-01T15:28:21.11...
    id                           (time) <U31 'LE07_L2SP_027027_20220701_02_T1...
  * band                         (band) <U8 'red' 'green' 'blue' 'qa_pixel'
  * x                            (x) float64 3.26e+05 3.26e+05 ... 3.37e+05
  * y                            (y) float64 2.778e+06 2.778e+06 ... 2.771e+06
    platform                     (time) <U9 'landsat-7' ... 'landsat-9'
    ...                           ...
    view:off_nadir               int64 0
    landsat:cloud_cover_land     (time) float64 14.0 69.06 40.75 ... 100.0 97.19
    gsd                          int64 30
    raster:bands                 (band) object {'scale': 2.75e-05, 'nodata': ...
    title                        (band) <U29 'Red Band' ... 'Pixel Quality As...
    epsg                         int64 5070
Attributes:
    spec:        RasterSpec(epsg=5070, bounds=(325980, 2770980, 337020, 27780...
    crs:         epsg:5070
    transform:   | 30.00, 0.00, 325980.00|\n| 0.00,-30.00, 2778000.00|\n| 0.0...
    resolution:  30
there are 6 items if we exclude cloudy images
<xarray.DataArray 'stackstac-b859f39f5ea1c12c247ef26512f01dbb' (time: 6,
                                                                band: 4,
                                                                y: 234, x: 368)>
dask.array<fetch_raster_window, shape=(6, 4, 234, 368), dtype=float64, chunksize=(1, 1, 234, 368), chunktype=numpy.ndarray>
Coordinates: (12/27)
  * time                         (time) datetime64[ns] 2022-07-01T15:28:21.11...
    id                           (time) <U31 'LE07_L2SP_027027_20220701_02_T1...
  * band                         (band) <U8 'red' 'green' 'blue' 'qa_pixel'
  * x                            (x) float64 3.26e+05 3.26e+05 ... 3.37e+05
  * y                            (y) float64 2.778e+06 2.778e+06 ... 2.771e+06
    platform                     (time) <U9 'landsat-7' ... 'landsat-9'
    ...                           ...
    view:off_nadir               int64 0
    landsat:cloud_cover_land     (time) float64 14.0 40.75 15.43 14.0 2.41 3.36
    gsd                          int64 30
    raster:bands                 (band) object {'scale': 2.75e-05, 'nodata': ...
    title                        (band) <U29 'Red Band' ... 'Pixel Quality As...
    epsg                         int64 5070
Attributes:
    spec:        RasterSpec(epsg=5070, bounds=(325980, 2770980, 337020, 27780...
    crs:         epsg:5070
    transform:   | 30.00, 0.00, 325980.00|\n| 0.00,-30.00, 2778000.00|\n| 0.0...
    resolution:  30
there are 8 items if we exclude Landsat 7
Traceback (most recent call last):
  File "/home/henry/workspace/hrodmn.dev/posts/hls/stackstac_inhomogeneous.py", line 72, in <module>
    landsat_stack_filt = stackstac.stack(
  File "/home/henry/workspace/hrodmn.dev/env/lib/python3.10/site-packages/stackstac/stack.py", line 311, in stack
    *to_coords(
  File "/home/henry/workspace/hrodmn.dev/env/lib/python3.10/site-packages/stackstac/prepare.py", line 500, in to_coords
    accumulate_metadata.metadata_to_coords(
  File "/home/henry/workspace/hrodmn.dev/env/lib/python3.10/site-packages/stackstac/accumulate_metadata.py", line 29, in metadata_to_coords
    return dict_to_coords(
  File "/home/henry/workspace/hrodmn.dev/env/lib/python3.10/site-packages/stackstac/accumulate_metadata.py", line 168, in dict_to_coords
    np.array(
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.
gjoseph92 commented 1 year ago

FYI @julianblue @hrodmn, I just released a version with this fixed (0.4.4). Sorry for the long wait.

hrodmn commented 1 year ago

Thanks @gjoseph92!