opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
795 stars 56 forks source link

Clarify usage with nested and repeated columns #47

Closed mentin closed 1 year ago

mentin commented 2 years ago

The Parquet format supports nested and repeated fields. I assume the geometry columns are not limited to the top-level columns, and can be both nested and repeated.

1. Names

The format spec talks about column names, however with nested structure a name might not uniquely identify a column.

I suggest using column path (like "a.b.c") in the docs to avoid the ambiguity. It would coincide with column name in typical case of top-level geometry.

2. Primary column

Can primary column be a nested column, or a repeated column i.e. contain list of geography values?

There is nothing that prevents this in the standard, but I guess the primary-column was designed to be mapped to built-it geometry column in formats like GeoJson or Shape files, and these assume non-repeated top level columns. We can either

cholmes commented 2 years ago

Thanks for the great feedback!

For 1. I think the column path makes good sense.

For 2. I lean towards restricting primary geometry column to be top-level, so that conversion to geojson / shapefile is clear, and straightforward in implementation. And I suppose making primary_column optional makes sense, but I feel like it'd be good to have something nudging people towards defining it if possible. But I certainly see the usefulness of allowing big parquet datasets that just have a nested geospatial value to be compliant without making them say 'this is a geo file'.

felixpalmer commented 2 years ago

I agree on point 1.

My feeling on 2 is that the primary_column should be restricted to be a top-level column, for a couple of reasons:

cholmes commented 1 year ago

Call 11/7

For first version (1.0.0) we want to limit geometry columns to only being at the top-level. There are very few geospatial packages that would be able to understand it. But if someone has a use case for nested geometry columns we can potentially add it in the future.

And repetition is optional or required (not repeated).

Need to update the spec in describing the geometry columns to be specific that we don't support grouped and repetition level is required or optional.

mentin commented 1 year ago

I think it is right decision for v1.

But I also wonder if there are many geospatial packages that support multiple geometry columns? I would think most that don't support nesting / repetition would also ignore all the columns besides "primary_column", and then nesting / repetition of additional geometry columns should not matter :).

We do have several customers who use repeated geometry columns. Typically, the primary geometry column is top level required column, and it is broken into parts, which are stored as nested or/and repeated columns. What I remember:

In these cases the primary geometry column is non-nested, non-repeated, but there are other columns that are nested inside repeated struct.

tschaub commented 1 year ago

Yeah, I can imagine this will be something that is revisited. From a writer's perspective, given that Parquet is capable of representing repeated and group fields, it is somewhat odd that a "geo" extension would restrict that. I guess we are anticipating the needs of readers in adding this restriction - but it may turn out to be unnecessarily restrictive.

jorisvandenbossche commented 1 year ago

But I also wonder if there are many geospatial packages that support multiple geometry columns?

GeoPandas supports this, and it seems R sf does as well (https://cran.r-project.org/web/packages/sf/vignettes/sf6.html#how-does-sf-deal-with-secondary-geometry-columns). PostGIS supports this as well (https://gis.stackexchange.com/questions/176263/can-a-postgis-table-or-view-have-two-geometry-columns). I know that GDAL also supports this in their OGR data model and C API, but it depends on the bindings to GDAL whether it's actually supported (I know that the python bindings right now will only return a single (first) geometry column).

I can certainly see the use case of repeated (list/array type) geometry columns. I also assume that databases (like BigQuery) that have both a proper array type and geometry/geography type will typically not limit combining those two in a repeated geometry type?