STAC Interoperability with Arrow

kylebarron commented 5 months ago

This is a clean up to https://github.com/stac-utils/stac-geoparquet/pull/27, which implemented a work-in-progress converter to and from Arrow memory, originally done during the STAC sprint.

Change list

Adds new functions that parse STAC Items from dicts or from a newline-delimited JSON file to an Arrow table
- Supports an optional schema argument for advanced users who know the schema of their STAC items. Note that this schema is applied after conversion to WKB but before any other conversions.
Adds new functions that convert the Arrow table back to dicts or to a newline-delimited JSON file
The Arrow table stores geometries as WKB to easily allow STAC Items with differing geometry types.
Converts bbox column to a struct-type column to align with GeoParquet 1.1

This approach may be preferred in some cases. It should be more memory efficient than the existing pandas approach, it's minimally manual (basically we offload all schema inference into the pa.array constructor), and it enforces strict schemas via inferred Arrow schema. In future work, we could also save memory with dictionary-encoded columns.

This also should be interoperable with the Arrow support in Pandas v2, which GeoPandas also supports.

This mostly supersedes https://github.com/stac-utils/stac-geoparquet/pull/27 but is created as a separate PR as it deletes the WIP streaming.py implementation from that PR.

TomAugspurger commented 5 months ago

Thanks!

I'm trying to round trip some NAIP items from the PC:

import pystac_client
import stac_geoparquet.to_parquet
import stac_geoparquet.from_arrow
import stac_geoparquet.to_arrow

items = list(
    pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    .search(collections="naip", max_items=4)
    .items_as_dicts()
)
table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))

and am hitting

TypeError                                 Traceback (most recent call last)
Cell In[11], line 12
      6 items = list(
      7     pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
      8     .search(collections="naip", max_items=4)
      9     .items_as_dicts()
     10 )
     11 table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
---> 12 items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))

File ~/src/stac-utils/stac-geoparquet/stac_geoparquet/from_arrow.py:27, in stac_table_to_items(table)
     24 # Convert WKB geometry column to GeoJSON, and then assign the geojson geometry when
     25 # converting each row to a dictionary.
     26 for batch in table.to_batches():
---> 27     geoms = shapely.from_wkb(batch["geometry"])
     28     geojson_strings = shapely.to_geojson(geoms)
     30     # RecordBatch is missing a `drop()` method, so we keep all columns other than
     31     # geometry instead

File ~/src/stac-utils/stac-geoparquet/.direnv/python-3.10.10/lib/python3.10/site-packages/shapely/io.py:320, in from_wkb(geometry, on_invalid, **kwargs)
    316 # ensure the input has object dtype, to avoid numpy inferring it as a
    317 # fixed-length string dtype (which removes trailing null bytes upon access
    318 # of array elements)
    319 geometry = np.asarray(geometry, dtype=object)
--> 320 return lib.from_wkb(geometry, invalid_handler, **kwargs)

TypeError: Expected bytes or string, got dict

Seems like the geometry column is geojson-like, but hsould be WKB?

In [18]: table["geometry"].to_pylist()
Out[18]: 
[{'coordinates': [[[-65.683663, 18.184851],
    [-65.684718, 18.253643],
    [-65.75386, 18.25266],
    [-65.752778, 18.183872],
    [-65.683663, 18.184851]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.746142, 18.184853],
    [-65.747222, 18.253666],
    [-65.816382, 18.25266],
    [-65.815275, 18.183852],
    [-65.746142, 18.184853]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.558704, 18.309849],
    [-65.559716, 18.378606],
    [-65.628821, 18.377663],
    [-65.627781, 18.30891],
    [-65.558704, 18.309849]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.496227, 18.309844],
    [-65.497215, 18.378583],
    [-65.566297, 18.377663],
    [-65.565282, 18.308929],
    [-65.496227, 18.309844]]],
  'type': 'Polygon'}]

kylebarron commented 5 months ago

Speaking of tests, thoughts on adding some basic ones, mainly making sure that round-trip between list[Item] <-> Table works?

Definitely. The nice part about this Arrow work is that it's a direct in-memory counterpart to the Parquet schema. So we can mainly test the Arrow interop and get the Parquet functionality for free, without having to test that step as rigorously.

Do you want to wait for https://github.com/stac-utils/stac-geoparquet/issues/39 to tackle tests?

Yeah that sounds good.

TomAugspurger commented 5 months ago

Great, thanks!

stac-utils / stac-geoparquet

STAC Interoperability with Arrow #37

Change list