Store optional bounding box information in the column metadata

jorisvandenbossche commented 2 years ago

There was a bit of discussion around this in https://github.com/opengeospatial/cdw-geo/pull/4.

The proposal is to add an optional column metadata field (alongside the currently required "crs" and "encoding" fields) that describes the bounding box of the full file (so the overall bounding box or envelope of all geometries in the file).

In the geo-arrow-spec version of this metadata specification, we are already using it (https://github.com/geopandas/geo-arrow-spec/blob/main/metadata.md#bounding-boxes), and there it takes the form of a a list that specifies the minimum and maximum values of each dimension. So for 2D data it would look like "bbox" : [<xmin>, <ymin>, <xmax>, <ymax>].

This formatting aligns with for example the GeoJSON spec (https://datatracker.ietf.org/doc/html/rfc7946#section-5).

This optional information can be useful when processing this data. For example, in dask-geopandas we already make use of this feature to filter partitions (sub-datasets) of a dataset. When using Parquet, people often make use of "partitioned datasets", where the dataset consists of (potentially nested directories of) many smaller Parquet files. In such a situation, you could spatially sort the data when dividing into partitions and each individual file could contain the data of a certain region. If each individual Parquet file would then store information about the bounding box of their geometries, this allows to only read those files needed when doing a spatial query while reading the dataset (a kind of "predicate pushdown", as can be done for Parquet based on column statistics).

cholmes commented 2 years ago

+1

I'm curious about the argument for making it optional and not required? Or at least recommended? Enabling spatial sorting of partitioned dataset seems like a pretty big win. I suppose we could have an 'extension' for the partitioned data use case where the bounds is required.

jorisvandenbossche commented 2 years ago

I suppose the main reason to have it optional is that it might require an additional computation to obtain those bbox values when writing the data (similarly in Parquet, column statistics (min/max) are optional). But I don't feel strongly about having it optional. And having it "recommended" is certainly good.

cholmes commented 2 years ago

similarly in Parquet, column statistics (min/max) are optional

Cool, that seems like a good precedent to follow. Let's go with 'recommended' then, and explain why it's good to have.

alasarr commented 2 years ago

+1

paleolimbot commented 2 years ago

+1 for "optional" (as is the case for most other spatial formats, whose writing can be done faster without computing anything).

It's worth considering the case of lon/lat here, where a rectangular bounding box is at worst "invalid" and at best "odd" once one gets close to the north pole, the south pole, or the international date line. S2's latlngrect and PROJ's "area" both can return a rectangle with something like "left_lon" and "right_lon" (rather than min/max) to address that. For geodedic coords, an S2 "covering" is a better choice anyway.

If a proper spatial index is an option (#13), that might be a better choice.

From a read perspective, if each rowgroup could get its own "bounding box" that would be even better than per-file.

Another thing to consider is that readers have to be very careful to invalidate the bounding box once a subset is computed (in the R bindings this is currently something that happens with a blind call to read_parquet()).

cholmes commented 2 years ago

Will be in the metadata, so will be JSON. Just an array of 4 numbers.

jorisvandenbossche commented 2 years ago

I opened an initial PR for this at https://github.com/opengeospatial/geoparquet/pull/21

It's worth considering the case of lon/lat here, where a rectangular bounding box is at worst "invalid" and at best "odd" ..

Yes, that is a good question, and something I am not fully sure about what to do with this. I also noted that on the PR (https://github.com/opengeospatial/geoparquet/pull/21#issuecomment-1056737550). The GeoJSON spec mentions that the edges are basically planar straight lines.

From a read perspective, if each rowgroup could get its own "bounding box" that would be even better than per-file.

Unfortunately, in the Arrow implementation of Parquet, we currently don't have access to the rowgroup's column chunk metadata (see somewhat related issue about this at https://issues.apache.org/jira/browse/ARROW-15548)

Another thing to consider is that readers have to be very careful to invalidate the bounding box once a subset is computed

Yes, that's indeed a responsibility for a reader (although as long as you only takes subsets, the bbox will not be really "invalid", but just larger than strictly necessary)

opengeospatial / geoparquet

Store optional bounding box information in the column metadata #8