Thoughts about a first-class GEOMETRY data type in Parquet?

opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet

https://geoparquet.org

Apache License 2.0

776 stars 54 forks source link

Thoughts about a first-class GEOMETRY data type in Parquet? #222

Open paleolimbot opened 2 months ago

paleolimbot commented 2 months ago

It looks like at least one contributor to the Parquet standard is interested in adding a first-class logical GEOMETRY type ( https://github.com/apache/parquet-format/pull/240 )! There's quite a bit of conversation on the PR (@jiayuasu has also been reviewing), but it seems like the motivation is to make it easier to implement geospatial support in iceberg ( https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI ) whilst retaining the ability to represent GeoParquet. One excerpt from the early comments:

Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

The advantage of first-class data type support in Parquet is mostly that we can embed geo-specific column statistics at the page/rowgroup level: currently we use a combination of the optional bounding box column (and its embedded statistics) and file-level metadata (e.g., bbox, geometry types) to encode data-dependent information readers can use to skip reading. The "geometry is just another column with just another data type" model is also possibly easier to reconcile with non-spatial Parquet readers/writers (that aren't designed to look for data-dependent statistics and/or data types in file-level metadata).

The disadvantage is, of course, that we've spent several years negotiating the language here and have thrown our collective capital (fiscal and social) into promoting this way of representing geospatial data in Parquet, including support in major data wharehouses, data engines, data producers, mainstream geospatial tools, and the OGC.

I am slightly concerned that proposal in the Parquet repo will end up dumbed down to an extent that it would not be usable for us; however, I think that if we have somewhat unified feedback on the PR we can probably avoid that. There are some Parquet-specific encoding details that the Parquet community is better suited to handling, but at a high level we might be able to achieve something like:

Column data encoded as WKB (preferably exactly as we've defined it here, or more strongly towards ISO maybe)
"crs", "edges", "orientation", and "epoch" encoded as JSON as part of the logical type (preferably exactly as we've defined them here)
"geometry_types" and "bbox" as column statistics (I think these would be available at the page and row group levels), or in the worst case at the ColumnChunk level (== column within a rowgroup, which has arbitrary key/value metadata as well).

I'd also personally like to see S2 and/or H3 coverings as an optional column statistic because they more elegantly solve the antimeridian/polar bbox problem and because adding them later might be hard (but totally optional).

We discussed this at the community meeting yesterday at length, but I think an issue thread is probably a better place to collect our thoughts on this!

cholmes commented 2 months ago

Sorry I missed the last meeting! Got called in for jury duty. But this all sounds like amazing progress - to me the biggest goal of GeoParquet was to bring great geo support to Parquet, and that if it Parquet were to have top-level support for geometries then we could just 'dissolve' GeoParquet.

I am slightly concerned that proposal in the Parquet repo will end up dumbed down to an extent that it would not be usable for us; however, I think that if we have somewhat unified feedback on the PR we can probably avoid that.

+1 - that seems to me the ideal thing, to bring in all the debate and choices we made into that PR so the core of Parquet works well for the most advanced geospatial use cases while being easy to use for anyone just dumping in some geo data. And +1 to all your suggestions, they sound good, though one I'm unsure on:

Column data encoded as WKB (preferably exactly as we've defined it here, or more strongly towards ISO maybe)

I am curious about this one. I would have guessed that the ideal parquet native geometry type would not be WKB, but something more columnar? Is the argument that using WKB in parquet will make it so readers are easier to write, and that it can use bbox in a more 'native' way for column statistics? To me the WKB was always a 'bridge' for readers, but the ideal always seemed like it'd be something that felt more like how Parquet does things, not just sticking a binary blob in there.

rouault commented 2 months ago

Is the argument that using WKB in parquet will make it so readers are easier to write, and that it can use bbox in a more 'native' way for column statistics?

I fail to see how a non-geoaware Parquet writer could generate column statistics from a WKB encoded geometry column without having access to a WKB decoder (at least a partial one able to recognize where coordinate tuples are located to compute the bounding box), given there's no easy to access geometry envelope in regular WKB (contrary to PostGIS enhanced WKB or GeoPackage blob, which incorporate one in the header preceding the WKB).

On the other hand WKB can express complex geometries (curves, collections, polyhedral surface) that may be difficult to express using "native" array constructs (everything is possible, but that might involve multiple cascaded arrays pointing at each other)

cholmes commented 2 months ago

I fail to see how a non-geoaware Parquet writer could generate column statistics from a WKB encoded geometry column without having access to a WKB decoder (at least a partial one able to recognize where coordinate tuples are located to compute the bounding box)

I was thinking something more like the envelope would just be stored with it. Like instead of 'bbox' as a separate column it's just somehow embedded in the 'geometry' data structure. I don't know enough about Parquet internals to know if that's possible, and I'm just shooting in the dark - having a native columnar geometry structure in Parquet would make more sense to me.

paleolimbot commented 2 months ago

I would have guessed that the ideal parquet native geometry type would not be WKB, but something more columnar?

A byte_array/blob column (e.g., simple storage but requires parsing) and a nested list are both complicated in different ways (e.g., not all Parquet readers support nested things). It seems like WKB is something that it easy to agree on as an initial encoding (not as much precedent for the columnar encodings in practice).

I fail to see how a non-geoaware Parquet writer could generate column statistics from a WKB encoded geometry column without having access to a WKB decoder (at least a partial one able to recognize where coordinate tuples are located to compute the bounding box)

I think it might be possible to get a WKB reader into parquet-cpp that generates the bounding box statistics on write (I'm happy to take a stab at it); however, I am not sure we have access to a JSON parser at that point in the code and thus its ability to be CRS-aware (e.g., deal with the antimeridian) might be limited. It may also be that this (plus other proposed changes in response to some other formats announced recently) is the push that parquet-cpp needs to inject some extensibility into the read/write process or reclaim its independence from Arrow C++.

paleolimbot commented 2 months ago

On the other hand WKB can express complex geometries

I am not sure that there is any intention to ever support anything other than geometry types POINT--GEOMETRYCOLLECTION.

rouault commented 2 months ago

I think it might be possible to get a WKB reader into parquet-cpp that generates the bounding box statistics on write

https://github.com/OSGeo/gdal/blob/12dab86ca2d8b1a65c4c085e137c62294682ac1d/ogr/ogr_wkb.cpp#L590 could serve as a starting point as well

jiayuasu commented 2 months ago

I think if Parquet adds the bbox info as its native statistics and we use WKB as the encoding, this could be a time saver to GeoParquet and maybe increase the adoption of GeoParquet.

We no longer need to develop a native encoding to handle all sorts of complex geometries. This will become a headache in the long term. The implementer of GeoParquet writer and reader can easily parse WKB and generate bbox statistics.
Parquet community will likely accept this bbox proposal because it has the potential to solve problems greater than geo (e.g., 3D object, ...)
WKB is widely adopted. There are tons of WKB reader/writer out there in many different languages.

I also support adding S2/H3 to Parquet but not sure if this will actually make it.

paleolimbot commented 2 months ago

If the WKB encoding is embedded at the same level as (e.g.) the compression, there is a chance that the reader (or a custom geo-specific one) can decode this directly into the columnar representation. Many types in parquet are encoded differently than their Arrow representation (e.g, using run-length encoding).

I also support adding S2/H3 to Parquet but not sure if this will actually make it.

I had forgotten about column metadata, which is arbitrary key/value storage at the row group -> column level. I think that might be a better place for S2/H3 coverings. We'd have to expose it from Parquet C++ (I don't know if it's exposed from Java or Rust).

wgtmac commented 2 months ago

Sorry for chiming in late. From all comments here and from https://github.com/apache/parquet-format/pull/240, I think I need to clarify what parquet can do and cannot do with a new geometry type.

The new Geometry logical type is just an annotation. It can simply use BYTE_ARRAY physical type for geometry column and store values in WKB. It can also annotate complex types similar to GeoArrow, for example, 2D-POINT might use struct<x:double,y:double> and 2D-LINESTRING might use list<struct<x:double,y:double>>. (Sorry, parquet does not yet have fixed_size_list type and it is under discussion now: https://github.com/apache/parquet-format/pull/241).
A specific logical type in parquet can carry some attributes. For example, decimal logical type requires precision and scale, timestamp logical type requires timezone info and precision. Likewise, geometry logical type can also store some essential information. However, in order not to duplicate column metadata from GeoParquet, I think the only attribute that parquet requires is the dimension of geometry. This is required to create bounding box on it. CMIW, attributes like crs, edges, orientation are informative from the perspective of parquet, and they can be fetched from the column metadata of GeoParquet. So I just want a dimension enum (with values xy, xyz, xym and xyzm) to the attributes of the geometry type.
The benefit of parquet native geometry type is that we can provide column statistics on the page level and row group level. (Sorry, no file level stats yet). For geometry type in BYTE_ARRAY type, the parquet implementation does have to be aware of WKB encoding and calculate the bbox of each page and row group. For other GeoArrow-like native encoding types, parquet can collect min/max values of each coordinate axis automatically.

I'd also personally like to see S2 and/or H3 coverings as an optional column statistic

Yes, that's good suggestion. The current limitation of parquet is that each column can have only one type of column statistics. This means we can choose only one (via writer options) among bbox, S2, or H3, not all. I agree that geometry_types from GeoParquet can also be a kind of column statistics.

I had forgotten about column metadata, which is arbitrary key/value storage at the row group -> column level.

In terms of column metadata, GeoParquet now stores all of them in the key-value metadata in the parquet footer. I'm not sure if it sounds good to add a key-value metadata to the geometry logical type to offload some to there. This might look like below:

enum Dimension { XY = 0; XYZ = 1; XYM = 2; XYZM = 3; }

struct GeometryType {
  // If missing, it is XY.
  1: optional Dimension dimension;
  // Offload GeoParquet column metadata to here?
  2: optional string metadata;
}

cholmes commented 2 months ago

I'm not sure if it sounds good to add a key-value metadata to the geometry logical type to offload some to there.

I'd lean towards putting as much of the geoparquet column level metadata as we can into geometry logical type. I think it would help the compatibility between iceberg geospatial support and good geospatial parquet, as it sounds like the core iceberg committers are not into the idea of trying to write 'special' metadata for geo just for parquet. They prefer to just store it at the table format level, but the downside is that would require readers to go 'through' iceberg for geo, and not able to read the geo information from the lower level. I fully sympathize with their reasoning, but I think we as a larger community can do better.

One of the main reasons is they also need to write to ORC and Avro, which also don't support geospatial. But ideally we get the same metadata values into them - awesome work @wgtmac on https://github.com/apache/orc-format/pull/18 I do think a secondary design goal of geoparquet was to align on good field names and values to easily add 'geo' to other formats.

@wgtmac - it'd be great to chat in realtime about your work and figure out how our GeoParquet community can help and support you. We have bi-weekly meetings every other monday - the next one is June 3rd at 10am pacific time - or I think we'd be happy to find another time.

paleolimbot commented 2 months ago

So I just want a dimension enum (with values xy, xyz, xym and xyzm) to the attributes of the geometry type.

I am actually not sure this is possible (at the file level), nor is it relevant for WKB (which can store XY geometries alongside XYZ geometries). The dimensions (and/or geometry types, and/or the unique combinations of both) are more like column statistics (i.e., which are present). It is, of course, useful to know at the type level whether the geometry type and/or dimensions are constant. The bbox could always be 8 numbers (min, max / x, y, z, m) and set to nan or inf/-inf if they are empty/don't apply.

I'd lean towards putting as much of the geoparquet column level metadata as we can into geometry logical type

I agree here...column metadata (as I understand it) is duplicated for each row group and not all readers implement it (again, I could be mistaken here). To the extent that we want to keep file level metadata around, I think we just want it to help readers that don't support the GEOMETRY logical type or care deeply about the primary geometry column name.

I'd also personally like to see S2 and/or H3 coverings as an optional column statistic

I think this is a battle for another day 🙂

cholmes commented 2 months ago

To the extent that we want to keep file level metadata around, I think we just want it to help readers that don't support the GEOMETRY logical type or care deeply about the primary geometry column name.

Yeah, that primary column attribute is the only thing that's maybe useful. And afaik it's really just a concern for geospatial readers that need to know which geometry to render if there's multiple columns. So doesn't seem worth trying to get that metadata into Parquet, unless it fits nicely somewhere. Still has some value - would be pretty funny if we ended up keeping the geoparquet spec as literally one file level field just to communicate the primary column name, but it doesn't seem crazy.

wgtmac commented 2 months ago

I'd lean towards putting as much of the geoparquet column level metadata as we can into geometry logical type.

Sounds good! But I'd like to mention that parquet logical type is only suitble to store static attributes like crs, edges, etc. It is not suitable to store bbox or other statistics. It is awkward that parquet does not have any file-level statistics which is different to ORC. For now we can only store file-level stats in the footer anyway.

We have bi-weekly meetings every other monday - the next one is June 3rd at 10am pacific time

I can try my best to join but it seems too late in UTC+8.

The dimensions (and/or geometry types, and/or the unique combinations of both) are more like column statistics (i.e., which are present). It is, of course, useful to know at the type level whether the geometry type and/or dimensions are constant. The bbox could always be 8 numbers (min, max / x, y, z, m) and set to nan or inf/-inf if they are empty/don't apply.

Thanks for suggestion! I thought the dimension should be constant accross values in the same column.

I'd also personally like to see S2 and/or H3 coverings as an optional column statistic

After a second thought, S2 and H3 convering does not fit the model of statistcis in parquet which requires min and max pairs. We might need to extend parquet ColumnChunk thrift object to put a geo-specific stats field to support them.

Again, thanks all feedbacks! I will modify the PR to adopt all good suggestions.

jorisvandenbossche commented 2 months ago

On a historical note, since this question came up in the community meeting last week about why we didn't take the path of a first-class geometry type in the past: apart from some (potentially misguided) reasons about thinking the Parquet community might not be interested, the main reason was because it was the easier path for adoption and usage. And personally, I think the success up to now of GeoParquet is in large part because of doing it outside of the core Parquet file format, with it being just a Parquet file any implementation can read with some standardized metadata. Because of that fact we could use existing implementations to quickly experiment and get things started (e.g. in geopandas using just the python bindings of parquet-cpp, or also GDAL didn't have to wait on new features in parquet-cpp).

Now that there is a community around GeoParquet, it makes sense to re-evaluate that path, and I find the idea of a native geometry type definitely interesting and exciting. But I do want to point out it is still a trade-off, there are still downsides to this as well. As already mentioned, it means that the low-level parquet writer implementation needs to be geo-aware (some minimal parsing of WKB to get bbox). With the existing community around GeoParquet, this might now be easier to justify including in Parquet implementations compared to a few years ago, though. But also the reader implementation will need to be taught about the bbox statistics type to be able to profit from this native geometry type (e.g. in the case of Parquet C++, this means not only libparquet itself but also the Acero query engine will need to know about that concept). Historically, the variety of Parquet implementations have shown to not always be quickly up to date with the latest format additions. I think this also means that for several years, we will have to recommend not to use this new feature for wide compatibility of your geoparquet files (for the different implementations that were checked (cpp, java, rs), current versions error when seeing an unknown logical type, instead of falling back to the physical type (which could have otherwise made this addition backwards compatible)). For several years, we will also have to duplicate the metadata in both the geometry type and the current file-level metadata "geo" key, for compatibility.

There are of course also very clear advantages, as already have been discussed above. In addition, a native type will also be a good motivation to actually do push some of the capabilities into the lower-level parquet implementations, avoiding some headaches with having to add metadata after the fact.

I'd lean towards putting as much of the geoparquet column level metadata as we can into geometry logical type.

Another +1 to that. I think ideally we would put everything from the geoparquet spec for a single column and that is not data-dependent (like bbox) in the logical type information. And I think as much as possible of this information can go into a single "metadata" field (which would make it easier to still evolve this somewhat, like adding a new key).

wgtmac commented 2 months ago

Update: I have simplified the proposal: https://github.com/apache/parquet-format/pull/240. Please let me know what you think. Thanks!

kylebarron commented 2 months ago

I think this also means that for several years, we will have to recommend not to use this new feature for wide compatibility of your geoparquet files

For what it's worth, this is my main worry, but maybe the long term gains are worth the near-term incompatibility hurdles?

paleolimbot commented 2 months ago

I think this also means that for several years, we will have to recommend not to use this new feature for wide compatibility of your geoparquet files

If we are able to get implementations out relatively quickly I think we can probably avoid this (I'm certainly game to help with the C++ implementation). Some of the relevant database vendors might be pinning old versions of Parquet readers/Arrow implementations; however, I am not sure if they were the ones hopping on the bandwagon to support file-level metadata either.

jorisvandenbossche commented 1 month ago

From https://github.com/opengeospatial/geoparquet/issues/222#issuecomment-2128396906

We no longer need to develop a native encoding to handle all sorts of complex geometries. This will become a headache in the long term. The implementer of GeoParquet writer and reader can easily parse WKB and generate bbox statistics.

We could also turn this argument around: let's focus on finding consensus around a native encoding for mixed geometries, and then this type can simply use the existing numeric min/max statistics, without the need to add a new GeometryStatistics to GeoParquet (which the low-level Parquet readers/writers have to learn about). Making a first-class geometry type less pressing (at for the statistics support, there are of course also other reasons).

paleolimbot commented 1 month ago

I agree that we should find a good way to represent mixed geometries in a geoarrow-ish way (and use it in Parquet if it makes sense to!). I also think that the effort required to do that and the ability to spread an implementation of it into the wider geospatial world to get encoding/decoding supported everywhere is much greater than the effort required to implement a WKB parser in a few Parquet readers. (It is common for other types like integers or floating point numbers to have alternative encodings in the Parquet file and parse them into other structures on the fly, too).

wgtmac commented 1 month ago

@jorisvandenbossche @paleolimbot I agree that finding a good way to represent mixed geometries natively is a good idea. I think this does not mean we can get away with WKB parser in the parquet implementations. WKB-encoded geometry type has been widely used in many projects and would be straight-forward to be adopted by them. Also we want to support Apache Iceberg which exactly encodes geometry using WKB. For any parquet implementation, it could simply disable GeometryStatistics if it does not understand WKB and use the binary data transparently.