stac-utils / stac-geoparquet

Convert STAC items between JSON, GeoParquet, pgstac, and Delta Lake.
https://stac-utils.github.io/stac-geoparquet/
MIT License
81 stars 10 forks source link

Items with heterogeneous Asset keys are parsed incorrectly #82

Open scottyhq opened 1 week ago

scottyhq commented 1 week ago
import pystac_client # 0.8.5
import stac_geoparquet  #0.6.0
import geopandas as gpd

client = pystac_client.Client.open(url='https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS')

results = client.search(collections=['ATL03_006'],
                        bbox='-108.34, 38.823, -107.728, 39.19',
                        datetime='2023',
                        method='GET',
                        max_items=5,
)
items = results.item_collection()
record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow(items)
gf = gpd.GeoDataFrame.from_arrow(record_batch_reader)  
gf.assets.iloc[0]

The 'data' asset keys are different for these 5 items, and every item gets a copy of the other keys with None as a value:

{'03/ATL03_20230103090928_02111806_006_02': {'href': 'https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.006/2023.01.03/ATL03_20230103090928_02111806_006_02.h5',
  'roles': array(['data'], dtype=object),
  'title': 'Direct Download'},
 '05/ATL03_20230205073720_07141806_006_02': None,
 '06/ATL03_20230206192127_07371802_006_02': None,
 '06/ATL03_20230306061322_11561806_006_02': None,
 '08/ATL03_20230108204519_02951802_006_02': None,
 'browse': {'href': 'https://n5eil01u.ecs.nsidc.org/DP0/BRWS/Browse.001/2024.04.08/ATL03_20230103090928_02111806_006_02_BRW.h5.images.tide_pole.jpg',
  'roles': array(['browse'], dtype=object),
  'title': 'Download ATL03_20230103090928_02111806_006_02_BRW.h5.images.tide_pole.jpg',
  'type': 'image/jpeg'},

These None entries prevent going back from a dataframe to pystac items:

import pystac
batch = stac_geoparquet.arrow.stac_table_to_items(gf.to_arrow())
items = [pystac.Item.from_dict(x) for x in batch]
File ~/GitHub/uw-cryo/coincident/.pixi/envs/dev/lib/python3.12/site-packages/pystac/asset.py:199, in Asset.from_dict(cls, d)
    193 """Constructs an Asset from a dict.
    194 
    195 Returns:
    196     Asset: The Asset deserialized from the JSON dict.
    197 """
    198 d = copy(d)
--> 199 href = d.pop("href")
    200 media_type = d.pop("type", None)
    201 title = d.pop("title", None)

AttributeError: 'NoneType' object has no attribute 'pop'
kylebarron commented 1 week ago

In general this is a limitation of Parquet.

JSON has three states: a valid value, null, and a missing/undefined key. Because Parquet is columnar, the third option does not exist here. If one key exists for any item, the entire column for that key name is provisioned.

The default arrow serialization emits None for null arrow values. There was some discussion about this on an issue previously. Perhaps we could add a keyword parameter to stac_table_to_items to remove keys with None. But this would be difficult to do reliably, especially when None is required for some other keys, like datetime, to mean that it has start/end datetime instead.

scottyhq commented 1 week ago

Right, there is this similar issue for when an outlier Item in a collection is missing an asset common to the group like 'thumbnail' https://github.com/stac-utils/stac-geoparquet/issues/77.

add a keyword parameter to stac_table_to_items to remove keys with None

Seems convenient. I wonder if it wouldn't be too tricky to only apply to the 'assets' column to avoid complications with datetime?

My quick workaround currently is to just filter the assets column after loading to geopandas:

def filter_assets(assets):
    """ Remove key:None pairs from assets """
    keep_keys = []
    for k,v in assets.items():
        if v is not None:
            keep_keys.append(k)

    return {key: assets[key] for key in keep_keys}

gf['assets'] = gf['assets'].apply(filter_assets)