Open scottyhq opened 1 week ago
In general this is a limitation of Parquet.
JSON has three states: a valid value, null
, and a missing/undefined key. Because Parquet is columnar, the third option does not exist here. If one key exists for any item, the entire column for that key name is provisioned.
The default arrow serialization emits None
for null arrow values. There was some discussion about this on an issue previously. Perhaps we could add a keyword parameter to stac_table_to_items
to remove keys with None
. But this would be difficult to do reliably, especially when None
is required for some other keys, like datetime
, to mean that it has start/end
datetime instead.
Right, there is this similar issue for when an outlier Item in a collection is missing an asset common to the group like 'thumbnail' https://github.com/stac-utils/stac-geoparquet/issues/77.
add a keyword parameter to stac_table_to_items to remove keys with None
Seems convenient. I wonder if it wouldn't be too tricky to only apply to the 'assets' column to avoid complications with datetime?
My quick workaround currently is to just filter the assets column after loading to geopandas:
def filter_assets(assets):
""" Remove key:None pairs from assets """
keep_keys = []
for k,v in assets.items():
if v is not None:
keep_keys.append(k)
return {key: assets[key] for key in keep_keys}
gf['assets'] = gf['assets'].apply(filter_assets)
The 'data' asset keys are different for these 5 items, and every item gets a copy of the other keys with
None
as a value:These None entries prevent going back from a dataframe to pystac items: