opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
838 stars 57 forks source link

Add GeoArrow encoding as an option to the specification #189

Closed paleolimbot closed 8 months ago

paleolimbot commented 1 year ago

Closes #185. See draft implementation at https://github.com/geoarrow/geoarrow-python/pull/41 (includes example of reading an arbitrary file using GDAL and writing to geoarrow-encoded GeoParquet if anybody would like to try with arbitrary data).

As discussed in early versions of this PR and #185, this adds the option for "encoding": "(point|linestring|polygon|multipoint|multilinestring|multipolygon)". This emphasizes that the encodings are for efficient encoding for single-type geometry datasets.

The notable advantages are (1) column statistics and (2) serialize/deserialize speed (no WKB encoding/decoding needed). The types used are also types that most systems understand natively (list, struct, double) and many systems should be able work with the files without needing any geometry support whatsoever (e.g., you could use duckdb to group by + summarize to compute bounding boxes).

There have been a number of comments about the possibility for better compression using byte split encoding. I haven't tried this yet but can run some experiments.

I also added a note to "compatible parquet"...admittedly fewer systems can write struct(x, y) than can write x, y, but it's in theory possible to do so. Unfortunately the memory layouts for geoarrow.linestring and geoarrow.multipoint/geoarrow.polygon and geoarrow.multilinestring overlap, so without metadata we either have to disallow them or just guess the higher dimension type. I put the "guess the higher dimension type" language in the spec...perhaps writers should prefer multilinestring/multipolygon over linestring/polygon to improve compatibility for readers without the ability to inspect metadata.

paleolimbot commented 10 months ago

I'm not a json schema expert, but would we be able to make this conditionally required? It looks like dependentRequired meets what we need, though I don't know what version of json schema we're pinned to.

It might be nice to also add a few basic tests for the json schema validation similarly as how https://github.com/opengeospatial/geoparquet/pull/191 is doing that.

I'm not a schema expert either! I'm happy to take a stab at that, but maybe it would be best to merge the text and follow-up with JSON schema tests? (I can do it here, too, but it may take me a bit to get there).

paleolimbot commented 9 months ago

So my recommendation would be to take out most references to geoarrow from this PR

Done! The values for "encoding" are now "point", "linestring", "polygon", "multipoint", "multilinestring" and "multipolygon". This more accurately reflects what the purpose of the encodings is reduces confusion regarding Arrow/Parquet. We could alternatively limit this list to "point"` and see how it goes since representing points is the main place where the current performance of GeoParquet is limiting uptake.

Does this mean that geometry column with mixed types of geometries cannot be encoded as GeoArrow?

The encoding of mixed geometries is independent to this PR...if the GeoParquet community finds a useful way to represent mixed geometries and can demonstrate a performance or usability benefit, another encoding can be added! This frees GeoArrow to represent things in the way that make sense there (e.g., using unions) and GeoParquet to represent things that make sense in the serialized/on disk/column statistics-are-important world.

I'll update the draft implementation shortly to reflect the changes in this PR!

paleolimbot commented 9 months ago

I also updated https://github.com/geoarrow/geoarrow-python/pull/41 (full implementation of the language here) to reflect the new language.

jorisvandenbossche commented 8 months ago

@paleolimbot I pushed a small commit more clearly separating the explanation of WKB and geoarrow-like encodings into subsections in the "encoding" section (just moving some things around, no actual text change).

That way I also moved the mention that WKB should use BYTE_ARRAY to that subsection as well. And will push another commit showing an example parquet type for the point geometry.

rouault commented 8 months ago

Are there (small) Parquet datasets somewhere using this new GeoArrow encoding? I now realize that the current support for GeoArrow in GDAL used the FixedSizeList[2] encoding and not the struct one... So I need to do changes and check interoperability

paleolimbot commented 8 months ago

I believe https://github.com/geoarrow/geoarrow-python/pull/41 is up to date (I will check tomorrow and generate some test files)

paleolimbot commented 8 months ago

Ok! I checked and it does seem to be generating files as described here. Here is a short snippet that should be able to generate/read them:

# git clone https://github.com/paleolimbot/geoarrow-python.git
# git switch geoparquet-geoarrow
# pip install geoarrow-pyarrow/
import pyarrow as pa
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io

def convert_to_geoparquet(src, dst, geometry_encoding=io.geoparquet_encoding_geoarrow()):
    tab = io.read_pyogrio_table(src)
    io.write_geoparquet_table(tab, dst, geometry_encoding=geometry_encoding)

def make_test_geoparquet(src, dst, geometry_encoding=io.geoparquet_encoding_geoarrow()):
    array = ga.array(src)
    tab = pa.table([array], names=["geometry"])
    io.write_geoparquet_table(tab, dst, geometry_encoding=geometry_encoding)

# Example convert
# More files: https://geoarrow.org/data
# There is a slight bug in these...the last element was supposed to be NULL for the examples
# but because sf in R doesn't support them, they didn't end up being written that way. They
# also have explicit planar CRSs because of sf.
convert_to_geoparquet(
    "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.1.0/example/example-linestring.gpkg",
    "example-linestring.parquet"
)

# Example read
tab = io.read_geoparquet_table("example-linestring.parquet")
tab["geom"].type.crs
tab["geom"][0].to_shapely()
ga.to_geopandas(tab["geom"])  # Doesn't work with nulls yet

# Example create
make_test_geoparquet(["POINT (0 1)", "POINT (30 10)", None], "test.parquet")

# Example read
tab = io.read_geoparquet_table("test.parquet")

...and here are a few pre-baked examples:

examples.zip

rouault commented 8 months ago

@paleolimbot I believe there's a slight non-conformity with the schema of your samples attached to https://github.com/opengeospatial/geoparquet/pull/189#issuecomment-2020497791. The x/y fields of the struct are currently marked as optional, whereas the spec requires them to be required. Cf

$ ~/arrow/cpp/build/release/parquet-dump-schema examples/point-geoarrow.parquet
required group field_id=-1 schema {
  optional int32 field_id=-1 row_number;
  optional group field_id=-1 geom {
    optional double field_id=-1 x;
    optional double field_id=-1 y;
  }
}
jorisvandenbossche commented 8 months ago

While we should provide some example data that fully follows the recommendations, those files are compliant in the sense that we recommend marking those fields as required, but that's not required (optional is fine as well):

There MUST NOT be any null values in the child fields and the x/y/z coordinate fields. Only the outer optional "geometry" group is allowed to have nulls (i.e representing a missing geometry). This MAY be indicated in the Parquet schema by using required group elements, as in the example above, but this is not required and optional fields are permitted (as long as the data itself does not contain any nulls).

rouault commented 8 months ago

those files are compliant in the sense that we recommend marking those fields as required, but that's not required (optional is fine as well):

ah ok, missed that. Sorry for the noise.

paleolimbot commented 8 months ago

Noise welcome! There are almost certainly other nonconformities too (but hopefully these are better than nothing to get you started).

jorisvandenbossche commented 8 months ago

We should probably also provide a version of the included example.parquet in geoarrow encoding, as well as for the nz-building-outlines.parquet.

For geoarrow-encoding, given that this is geometry type specific, it might make sense to include example files in this repo for all types. Or update the geoarrow-data repo to also include Parquet versions of the files (currently I think it hosts only .arrow files?)

paleolimbot commented 8 months ago

Or update the geoarrow-data repo to also include Parquet versions of the files (currently I think it hosts only .arrow files?)

Definitely! (And also fix the existing data).