opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
838 stars 57 forks source link

Metadata encoding options for GeoArrow-encoded columns in GeoParquet metadata #185

Closed paleolimbot closed 8 months ago

paleolimbot commented 1 year ago

Since #1 was opened, the https://github.com/geoarrow/geoarrow repo has seen quite a lot of exciting activity...we're getting close to releasing our initial version! We have had a lot of great anecdotal conversations about how or if geometry encoded as GeoArrow should be included in this specification and I wanted to open this issue to formalize some of the points that have been made.

Anecdotally, there has been been general agreement that including a columnar-friendly memory layout (i.e., one that does not require a parser of any kind to access coordinate values) as an option under the "encoding" metadata key would be good for GeoParquet because:

I think there are two orthogonal things to consider if GeoArrow will be included in a future (e.g., 1.1.0) GeoParquet specification. First, there is the question of how to structure the "encoding". Currently GDAL's experimental support uses the extension type name (as summarised here: https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ) as the encoding key. This is sufficient for a reader to reconstruct a GeoArrow type when reading a Parquet file:

{"encoding": "geoarrow.point", ...}

We could also just use "geoarrow" and declare the extension name somewhere else:

{"encoding": "geoarrow", "extension_name": "point"}

...or infer the extension name from the geometry type:

{"encoding": "geoarrow", ..., "geometry_types": ["Point"]}

(I don't like that last option because there are GeoArrow extension types for WKT and WKB. Even if they aren't necessarily allowed/encouraged for use in this spec, I don't think we can guarantee that there is one canonical extension name per combination of geometry types and functionally the extension name is what is required for a reader implementation)

The second consideration is which GeoArrow memory layouts to allow. The GeoArrow specification, like the Arrow specification, expanded to fit ways that we know people are already storing geospatial data in Arrow (i.e., it is currently more descriptive than prescriptive). The GeoParquet format and the discussions that went into creating it seem to favour a more prescriptive approach (i.e., restricting the allowed encodings/values to simplify implementations). For example, GeoParquet could provide language like:

The only GeoArrow extensions that may be encoded in GeoParquet are geoarrow.point, geoarrow.multipoint, geoarrow.multilinestring, and geoarrow.multipolygon, and these extensions must be written using the struct/separated coordinate encoding.

The other end of the spectrum would be to just punt to the GeoArrow spec and allow any of the values we've defined.

The extension name refers to the extension names defined in the GeoArrow specification at XXXX.

Looking forward to discussion on this! (cc @kylebarron @jorisvandenbossche )

jorisvandenbossche commented 1 year ago

I don't like that last option because there are GeoArrow extension types for WKT and WKB. Even if they aren't necessarily allowed/encouraged for use in this spec, I don't think we can guarantee that there is one canonical extension name per combination of geometry types and functionally the extension name is what is required for a reader implementation

I do think that we should probably require to use "encoding": "WKB" for those cases, and disallow "encoding": "geoarrow.wkb", because otherwise that gives two ways to specify the same? And while this requires some name mapping from geoarrow-aware writers, it ensures that all existing readers will still work fine for files using WKB.

(which I think also makes this option of using "encoding": "geoarrow" combined with geometry_types a possibility, although still not necessarily a preferred option)

The second consideration is which GeoArrow memory layouts to allow.

I think we should best list the options that are allowed. We can always expand that later if geoarrow grows more options. (for the example you gave, is there a reason you only listed "geoarrow.multipolygon" and not "geoarrow.polygon"?)

For the interleaved vs separated layout: I think it is clear that the separated layout has the most benefit in combination with Parquet, because of the statistics you get for free (and maybe better compression / faster reading). But I am not fully sure we should only allow that layout. It's certainly possible to have a case where you don't care about this, and you just need the fastest possible option to store and re-read a bunch of data. And if your target system needs interleaved data (like shapely/geopandas), storing as interleaved might be the fastest option (although I should verify this in practice!)

For the actual specification update, we should probably detail for the different geoarrow types to which Parquet type it maps.

jorisvandenbossche commented 1 year ago

Some advantages/disadvantages I can think of for the different options how to specify this:

{"encoding": "geoarrow.point", ...}

Pro is that this the encoding value fully describes the geoarrow type. But a disadvantage is that this adds a whole series of possible values for the "encoding" key. This makes handling of this key a bit more complex (although in Python terms it would be col["encoding"].startswith("geoarrow") instead of col["encoding"] == "geoarrow")

{"encoding": "geoarrow", "extension_name": "geoarrow.point"}

Pro is that this adds only a single new "encoding" value. But then you also still need to check the value of the other key to get the actual type. If we go with this, I would rather use a different key than "extension_name". The "extension" in this is a rather Arrow-specific term, and while the encoding itself is also called "geoarrow", this can still be implemented by Parquet implementations or systems that don't have anything to do with Arrow. We could also use a more generic "geoarrow_type"?

{"encoding": "geoarrow", ..., "geometry_types": ["Point"]}

Similar advantage of only adding a single "encoding" value, and additional advantage of not having to add a custom key that is only needed for geoarrow encoded data like above. But clear disadvantage is that you need to transform and combine the two keys manually to get the actual geoarrow type name.