Closed paleolimbot closed 8 months ago
I'm not a json schema expert, but would we be able to make this conditionally required? It looks like dependentRequired meets what we need, though I don't know what version of json schema we're pinned to.
It might be nice to also add a few basic tests for the json schema validation similarly as how https://github.com/opengeospatial/geoparquet/pull/191 is doing that.
I'm not a schema expert either! I'm happy to take a stab at that, but maybe it would be best to merge the text and follow-up with JSON schema tests? (I can do it here, too, but it may take me a bit to get there).
So my recommendation would be to take out most references to geoarrow from this PR
Done! The values for "encoding"
are now "point"
, "linestring"
, "polygon"
, "multipoint"
, "multilinestring"
and "multipolygon"
. This more accurately reflects what the purpose of the encodings is reduces confusion regarding Arrow/Parquet. We could alternatively limit this list to "point
"` and see how it goes since representing points is the main place where the current performance of GeoParquet is limiting uptake.
Does this mean that geometry column with mixed types of geometries cannot be encoded as GeoArrow?
The encoding of mixed geometries is independent to this PR...if the GeoParquet community finds a useful way to represent mixed geometries and can demonstrate a performance or usability benefit, another encoding can be added! This frees GeoArrow to represent things in the way that make sense there (e.g., using unions) and GeoParquet to represent things that make sense in the serialized/on disk/column statistics-are-important world.
I'll update the draft implementation shortly to reflect the changes in this PR!
I also updated https://github.com/geoarrow/geoarrow-python/pull/41 (full implementation of the language here) to reflect the new language.
@paleolimbot I pushed a small commit more clearly separating the explanation of WKB and geoarrow-like encodings into subsections in the "encoding" section (just moving some things around, no actual text change).
That way I also moved the mention that WKB should use BYTE_ARRAY to that subsection as well. And will push another commit showing an example parquet type for the point geometry.
Are there (small) Parquet datasets somewhere using this new GeoArrow encoding? I now realize that the current support for GeoArrow in GDAL used the FixedSizeList[2] encoding and not the struct one... So I need to do changes and check interoperability
I believe https://github.com/geoarrow/geoarrow-python/pull/41 is up to date (I will check tomorrow and generate some test files)
Ok! I checked and it does seem to be generating files as described here. Here is a short snippet that should be able to generate/read them:
# git clone https://github.com/paleolimbot/geoarrow-python.git
# git switch geoparquet-geoarrow
# pip install geoarrow-pyarrow/
import pyarrow as pa
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io
def convert_to_geoparquet(src, dst, geometry_encoding=io.geoparquet_encoding_geoarrow()):
tab = io.read_pyogrio_table(src)
io.write_geoparquet_table(tab, dst, geometry_encoding=geometry_encoding)
def make_test_geoparquet(src, dst, geometry_encoding=io.geoparquet_encoding_geoarrow()):
array = ga.array(src)
tab = pa.table([array], names=["geometry"])
io.write_geoparquet_table(tab, dst, geometry_encoding=geometry_encoding)
# Example convert
# More files: https://geoarrow.org/data
# There is a slight bug in these...the last element was supposed to be NULL for the examples
# but because sf in R doesn't support them, they didn't end up being written that way. They
# also have explicit planar CRSs because of sf.
convert_to_geoparquet(
"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.1.0/example/example-linestring.gpkg",
"example-linestring.parquet"
)
# Example read
tab = io.read_geoparquet_table("example-linestring.parquet")
tab["geom"].type.crs
tab["geom"][0].to_shapely()
ga.to_geopandas(tab["geom"]) # Doesn't work with nulls yet
# Example create
make_test_geoparquet(["POINT (0 1)", "POINT (30 10)", None], "test.parquet")
# Example read
tab = io.read_geoparquet_table("test.parquet")
...and here are a few pre-baked examples:
@paleolimbot I believe there's a slight non-conformity with the schema of your samples attached to https://github.com/opengeospatial/geoparquet/pull/189#issuecomment-2020497791. The x/y fields of the struct are currently marked as optional, whereas the spec requires them to be required. Cf
$ ~/arrow/cpp/build/release/parquet-dump-schema examples/point-geoarrow.parquet
required group field_id=-1 schema {
optional int32 field_id=-1 row_number;
optional group field_id=-1 geom {
optional double field_id=-1 x;
optional double field_id=-1 y;
}
}
While we should provide some example data that fully follows the recommendations, those files are compliant in the sense that we recommend marking those fields as required
, but that's not required (optional
is fine as well):
There MUST NOT be any null values in the child fields and the x/y/z coordinate fields. Only the outer optional "geometry" group is allowed to have nulls (i.e representing a missing geometry). This MAY be indicated in the Parquet schema by using
required
group elements, as in the example above, but this is not required andoptional
fields are permitted (as long as the data itself does not contain any nulls).
those files are compliant in the sense that we recommend marking those fields as
required
, but that's not required (optional
is fine as well):
ah ok, missed that. Sorry for the noise.
Noise welcome! There are almost certainly other nonconformities too (but hopefully these are better than nothing to get you started).
We should probably also provide a version of the included example.parquet
in geoarrow encoding, as well as for the nz-building-outlines.parquet.
For geoarrow-encoding, given that this is geometry type specific, it might make sense to include example files in this repo for all types. Or update the geoarrow-data repo to also include Parquet versions of the files (currently I think it hosts only .arrow files?)
Or update the geoarrow-data repo to also include Parquet versions of the files (currently I think it hosts only .arrow files?)
Definitely! (And also fix the existing data).
Closes #185. See draft implementation at https://github.com/geoarrow/geoarrow-python/pull/41 (includes example of reading an arbitrary file using GDAL and writing to geoarrow-encoded GeoParquet if anybody would like to try with arbitrary data).
As discussed in early versions of this PR and #185, this adds the option for
"encoding": "(point|linestring|polygon|multipoint|multilinestring|multipolygon)"
. This emphasizes that the encodings are for efficient encoding for single-type geometry datasets.The notable advantages are (1) column statistics and (2) serialize/deserialize speed (no WKB encoding/decoding needed). The types used are also types that most systems understand natively (list, struct, double) and many systems should be able work with the files without needing any geometry support whatsoever (e.g., you could use duckdb to group by + summarize to compute bounding boxes).
There have been a number of comments about the possibility for better compression using byte split encoding. I haven't tried this yet but can run some experiments.
I also added a note to "compatible parquet"...admittedly fewer systems can write
struct(x, y)
than can writex, y
, but it's in theory possible to do so. Unfortunately the memory layouts forgeoarrow.linestring
andgeoarrow.multipoint
/geoarrow.polygon
andgeoarrow.multilinestring
overlap, so without metadata we either have to disallow them or just guess the higher dimension type. I put the "guess the higher dimension type" language in the spec...perhaps writers should prefer multilinestring/multipolygon over linestring/polygon to improve compatibility for readers without the ability to inspect metadata.