Closed kylebarron closed 5 months ago
Thanks!
I'm trying to round trip some NAIP items from the PC:
import pystac_client
import stac_geoparquet.to_parquet
import stac_geoparquet.from_arrow
import stac_geoparquet.to_arrow
items = list(
pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
.search(collections="naip", max_items=4)
.items_as_dicts()
)
table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))
and am hitting
TypeError Traceback (most recent call last)
Cell In[11], line 12
6 items = list(
7 pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
8 .search(collections="naip", max_items=4)
9 .items_as_dicts()
10 )
11 table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
---> 12 items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))
File ~/src/stac-utils/stac-geoparquet/stac_geoparquet/from_arrow.py:27, in stac_table_to_items(table)
24 # Convert WKB geometry column to GeoJSON, and then assign the geojson geometry when
25 # converting each row to a dictionary.
26 for batch in table.to_batches():
---> 27 geoms = shapely.from_wkb(batch["geometry"])
28 geojson_strings = shapely.to_geojson(geoms)
30 # RecordBatch is missing a `drop()` method, so we keep all columns other than
31 # geometry instead
File ~/src/stac-utils/stac-geoparquet/.direnv/python-3.10.10/lib/python3.10/site-packages/shapely/io.py:320, in from_wkb(geometry, on_invalid, **kwargs)
316 # ensure the input has object dtype, to avoid numpy inferring it as a
317 # fixed-length string dtype (which removes trailing null bytes upon access
318 # of array elements)
319 geometry = np.asarray(geometry, dtype=object)
--> 320 return lib.from_wkb(geometry, invalid_handler, **kwargs)
TypeError: Expected bytes or string, got dict
Seems like the geometry
column is geojson-like, but hsould be WKB?
In [18]: table["geometry"].to_pylist()
Out[18]:
[{'coordinates': [[[-65.683663, 18.184851],
[-65.684718, 18.253643],
[-65.75386, 18.25266],
[-65.752778, 18.183872],
[-65.683663, 18.184851]]],
'type': 'Polygon'},
{'coordinates': [[[-65.746142, 18.184853],
[-65.747222, 18.253666],
[-65.816382, 18.25266],
[-65.815275, 18.183852],
[-65.746142, 18.184853]]],
'type': 'Polygon'},
{'coordinates': [[[-65.558704, 18.309849],
[-65.559716, 18.378606],
[-65.628821, 18.377663],
[-65.627781, 18.30891],
[-65.558704, 18.309849]]],
'type': 'Polygon'},
{'coordinates': [[[-65.496227, 18.309844],
[-65.497215, 18.378583],
[-65.566297, 18.377663],
[-65.565282, 18.308929],
[-65.496227, 18.309844]]],
'type': 'Polygon'}]
Speaking of tests, thoughts on adding some basic ones, mainly making sure that round-trip between list[Item] <-> Table works?
Definitely. The nice part about this Arrow work is that it's a direct in-memory counterpart to the Parquet schema. So we can mainly test the Arrow interop and get the Parquet functionality for free, without having to test that step as rigorously.
Do you want to wait for https://github.com/stac-utils/stac-geoparquet/issues/39 to tackle tests?
Yeah that sounds good.
Great, thanks!
This is a clean up to https://github.com/stac-utils/stac-geoparquet/pull/27, which implemented a work-in-progress converter to and from Arrow memory, originally done during the STAC sprint.
Change list
schema
argument for advanced users who know the schema of their STAC items. Note that this schema is applied after conversion to WKB but before any other conversions.bbox
column to a struct-type column to align with GeoParquet 1.1This approach may be preferred in some cases. It should be more memory efficient than the existing pandas approach, it's minimally manual (basically we offload all schema inference into the
pa.array
constructor), and it enforces strict schemas via inferred Arrow schema. In future work, we could also save memory with dictionary-encoded columns.This also should be interoperable with the Arrow support in Pandas v2, which GeoPandas also supports.
This mostly supersedes https://github.com/stac-utils/stac-geoparquet/pull/27 but is created as a separate PR as it deletes the WIP
streaming.py
implementation from that PR.