stac-utils / stac-geoparquet

Convert STAC items between JSON, GeoParquet, pgstac, and Delta Lake.
https://stac-utils.github.io/stac-geoparquet/
MIT License
78 stars 9 forks source link

Scalable schema-naive ingestion #49

Open kylebarron opened 4 months ago

kylebarron commented 4 months ago

In general, converting STAC to GeoParquet runs into schema inference issues, because GeoParquet needs a strict schema while STAC can have a much looser schema, or a schema that changes per row.

The current Arrow-based conversion approach uses two alternate methods:

Instead, in chatting with @bitner, we realized that we could improve on these two approaches by leveraging the knowledge that we're working with STAC spec objects. As long as the user knows which extensions are included in a collection, stac-geoparquet can pre-define the maximal Arrow schema defined by the STAC Item specification. This allows for minimal work by the end user while enabling streaming conversions of JSON data into GeoParquet.

To avoid the user needing to know the full set of asset names, we define assets under a Map type, which has pros and cons as noted in #48. In particular, it's not possible to statically infer the asset key names from the Parquet schema using a Map type, and it's also not possible to access data from only a single asset without downloading data for every asset. E.g. if you wanted to know the red asset's href, you'd have to download the hrefs for all assets, while a struct type would allow you to access only the red href column.

But converting first into a Map-based GeoParquet file, as we do in this PR, could make for an efficient ingestion process, because it would allow us to quickly find the full set of asset names.

So this scalable STAC ingestion would become a two-step process:

  1. Convert STAC to a "flexible schema GeoParquet"
  2. Convert this intermediate Parquet format into STAC-GeoParquet spec-compliant files. This step could also exclude any columns that are defined by the spec but not included in any JSON file. (It's easy from the Parquet metadata to see if any column is fully null).

The second part would become much, much easier by happening after the first step, instead of trying to start directly from JSON files.

Change list

This heavily uses pyarrow.unify_schemas to be able to work with partial schemas (for the core spec and for each extension).

This continues the discussion started in https://github.com/stac-utils/stac-geoparquet/issues/48.