opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
828 stars 56 forks source link

Mixed concerns: Encoding + Geometry Type #207

Closed m-mohr closed 5 months ago

m-mohr commented 6 months ago

I stumbled across the new geometry types mentioned in the encoding. It seems the encoding is GeoArrow, but why are the concerns mixed here?

Why are geometry types in the encoding and that happens with the geometry type then? Are they alway the same for GeoArrow?

From an external perspective I'd have expected something like this: endocing = geoarrow geometry_types = [Point]

Also, the schema allows for example: encoding = point geometry_types = [Polygon]

The empty array for geometry_types also doesn't make sense for GeoArrow encoding.

In case of geoarrow the geometry_types can have a maximum of one array item, but that could be encoded...

(Without reading the spec, I also can't guess from the value that it's a GeoArrow encoding. Might make things just a little simpler to grasp by default.)

Wouldn't this also be more future-proof with regards to https://github.com/apache/parquet-format/pull/44 ?

jorisvandenbossche commented 6 months ago

There was quite some discussion about this on the PR https://github.com/opengeospatial/geoparquet/pull/189 (see for example the thread above and below this comment: https://github.com/opengeospatial/geoparquet/pull/189#discussion_r1473059702). The PR initially started with an "encoding": "geoarrow", and the original issue (https://github.com/opengeospatial/geoparquet/issues/185) also mentioned the option of combining this with the existing geometry_types key.

But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification. (from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)

The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow (and yes, that means you might need to read the spec when encountering a file with such a column, but IMO that would still be the case anyway even if the encoding said "geoarrow").

It's true that the geometry_types key is not very useful in this specific case, but it should of course exactly match the type in the data, so something like encoding = point, geometry_types = [Polygon] is invalid (that already follows from the current spec saying this field should match with the data, but we could also be more explicit and mention in the native encodings section that geometry_type then is always a length-1 list)

jorisvandenbossche commented 6 months ago

The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow

Although I see we only link to the GeoArrow document with the names, not the the document with the memory layouts (https://geoarrow.org/format.html#native-encoding). That's something we should improve.

kylebarron commented 6 months ago

It's true that the geometry_types key is not very useful in this specific case

It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons

paleolimbot commented 6 months ago

I think that it is impossible not to mix some concerns with the single geometry-type encodings...the solution we settled on does have some overlap between the encoding name and the geometry type, but avoids mixing some other concerns and concepts to more accurately convey the relationships among the single-geometry layouts, Parquet and Arrow, for example.

It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons

I was under the impression that any features in a "multipolygon" encoding were all multipolygons (even if they only contained one element). This would be consistent with something like GDAL's reporting a layer's geometry type. We should probably make that explicit in the spec language if it is currently ambiguous.

It's true that the geometry_types key is not very useful in this specific case,

Technically it will tell you if you have Z coordinates or not (because the encoding key does not contain information about dimensions but the geometry_types key does). Leaving it empty would be a strange thing to do and I don't think it will be a problem for writers to get this part right; however, it is also not difficult (and likely safer, since it needs to check the memory layout anyway) for readers to fill in this piece of information themselves.

jiayuasu commented 6 months ago

In the future, GeoParquet should really support mixed geometry types in the same column. Sedona community will propose some solutions soon.

For example, in the lates release of Overture Maps data (https://docs.overturemaps.org/schema/), base/water has LineString type water (rivers) and Polygon type water (lakes). This is not possible in the new GeoParquet encoding.

cholmes commented 5 months ago

But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification. (from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)

I think it may be too late, as we've already got some implementations released with this, but I was also wondering why we didn't put some 'prefix' on the encodings. Like if not geoarrow.point then at least like columnar.point or something (there's likely some better name). Or a new field for 'columnar_geometry_type' and have the encoding just be 'columnar' and 'wkb'.

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints', and I had been thinking that geoparquet spec is 'open' to having a newer encoding, but the current arrow ones seem to take up the 'default namespace'. Like we could add some cool new encoding, and they'll likely have the same geometry types. I don't think it's a huge deal to have like new_encoding.point, but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.

m-mohr commented 5 months ago

That's the risk of implementing an unreleased spec, I'd say... It should still be possible to make a change.

jorisvandenbossche commented 5 months ago

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints',

While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta" (or at that point add a prefix like "delta.point")

but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.

I wouldn't say the "right" way, but I think it is the simplest way. I personally think it is fine to use the implicit "default" namespace for that. About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)


While I am personally happy with what we have right now, I agree we should still allow ourselves to change things (although if we think we want to do that, I wouldn't wait too long with that, so GDAL/geopandas can be updated to follow this as fast as possible). Both in geopandas and GDAL I think this is opt-in, and not used by default (well, for this topic; the also unreleased bbox covering column is written by default in GDAL)

(that's generally the trade-off we have with the desire to have some implementations to test things before calling the 1.1 spec final, and then actually thinking to change things before doing that ..)

jorisvandenbossche commented 5 months ago

have the encoding just be 'columnar'

FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar. Also with the WKB encoding, all those WKB values in the geometry column are stored in a "columnar" fashion (all the binary blobs of the WKB values are stored together in one buffer).

jwass commented 5 months ago

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints

Somewhat tangential - I've been pretty eager to test out new encodings like the varint/delta/zigzag Chris mentioned (used in OSM PBF) and also Parquet's native bit-packed integer encodings as well. Although these would also require x,y columns to be integers not doubles - which I thought might be a non-starter

paleolimbot commented 5 months ago

zigzag encoding coordinate deltas in varints

I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.

That's the risk of implementing an unreleased spec

This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are). If/when a better encoding comes along, the name can be updated.

jwass commented 5 months ago

I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.

I agree and don't want to derail the conversation here. However if we could enable x and y to be fixed precision decimal we could utilize parquet native integer encodings (e.g. delta binary packed) and not need geo-specific parsing to read the coordinates properly... but let's leave that for another discussion :)

cholmes commented 5 months ago

This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are).

Yeah, I felt bad that I didn't read it closely enough / didn't think about there being new potential encodings.

If/when a better encoding comes along, the name can be updated.

Ah, that's a good point - I was sorta thinking we'd be 'stuck' with these names. But I guess the client logic would just have to be that at version 1.4 or whatever it would need to check the version number.

While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta" (or at that point add a prefix like "delta.point")

Cool, I like this idea.

About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)

Yeah, that sounds good. I'll look into tweaks that may make it clearer. It's good in the full description (one of the geometry types, or WKB), but in the single line it just looks like one of many options.

I personally think it is fine to use the implicit "default" namespace for that.

Sounds great. Thanks to everyone for sounding in, and I agree. I definitely didn't want to try to be pushing for a change at this late time. I mostly just wanted to be sure we had a path to other encodings, and weren't backing ourselves into a corner with declaring this one. But it sounds like we have lots of options. And it's good to have the discussion, to point at in the future if/when people want to propose other encodings.

cholmes commented 5 months ago

have the encoding just be 'columnar'

FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar.

Yeah, I was pretty sure that was a poor name - I just didn't want to spend a lot of time crafting an 'ideal' name when all I was trying to do was to make the point. If we had overwhelming enthusiasm to change (which to be clear I personally don't) then we could figure out the 'right' name.

cholmes commented 5 months ago

Closing this issue, as it looks like it was well discussed. There were some interesting things in this discussion, like https://github.com/opengeospatial/geoparquet/issues/207#issuecomment-2081787217 but we should just have cleaner issues / PR's.