stac-utils / stac-geoparquet

Convert STAC items between JSON, GeoParquet, pgstac, and Delta Lake.
https://stac-utils.github.io/stac-geoparquet/
MIT License
78 stars 9 forks source link

Write to Delta Lake #58

Closed kylebarron closed 3 months ago

kylebarron commented 3 months ago

This PR adds a new function parse_stac_ndjson_to_delta_lake to convert a JSON source to a Delta Lake table. It is based on https://github.com/stac-utils/stac-geoparquet/pull/57, so only look at the most recent commits, and that PR should be merged first.

There's a complication here: Delta Lake refuses to write any column inferred with data type null, with:

_internal.SchemaMismatchError: Invalid data type for Delta Lake: Null

This is a problem because if all items in a STAC Collection have a null JSON key, it gets inferred as an Arrow null type. For example, in the 3dep-lidar-copc collection in the tests, it has start_datetime and end_datetime fields, and so according to the spec, datetime is always null. This means we cannot write this collection to Delta Lake solely with automatic schema inference.

In the latest commit I started to implement some manual schema modifications for datetime and proj:epsg, which fixed the error for 3dep-lidar-copc. But 3dep-lidar-dsm has more fields that are inferred as null. In particular the schema paths:

properties.raster:bands.pdal_pipeline.[].filename
properties.raster:bands.pdal_pipeline.[].resolution

are both null. It's not ideal to hard-code manual overrides for every extension, so we should discuss how to handle this.

Possible options:

kylebarron commented 3 months ago

This PR should be ready to go, where we don't yet solve the null type issue, but rather for now require in those cases that the user handle schema resolution manually.

In a follow up PR we may want to consider defaulting null types to string, but that may complicate schema evolution if later data has non-null values for STAC keys.