opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
828 stars 56 forks source link

Recommendation on the Arrow specific type for the WKB geometry column ? #187

Closed rouault closed 11 months ago

rouault commented 1 year ago

The GeoParquet spec rightly specfies the type of geometry columns in terms of Parquet type: "Geometry columns MUST be stored using the BYTE_ARRAY parquet type" For implementations using the Arrow library (typically from C++, but possibly from Python or other languages), they might use the Arrow API for Parquet reading & writing and thus not be directly exposed to Parquet types. I came to realize recently that the GDAL implementation would only work for the Arrow::Type::BINARY, but not for Arrow::Type::LARGE_BINARY. This has been addressed, on the reading side of the driver, for GDAL 3.8.0 per https://github.com/OSGeo/gdal/pull/8618 . I'm not entirely clear how the Arrow library maps a large Parquet file with row group with a WKB column with more than 2 GB of content. I assume that would be Arrow::Type::LARGE_BINARY ? So the question is if there should be some hints in the spec for implementations using the Arrow library on:

paleolimbot commented 1 year ago

which Arrow type(s) they should use on the writing side

I would say that it is perhaps good practice to write row groups such that the WKB column has chunks that fit into a (non-large) binary array, although it's difficult to guarantee that I think (@jorisvandenbossche would know better than I would). pyarrow prefers to chunk arrays that would contain more than 2GB of content rather than return large binary arrays, but the R bindings don't.

which Arrow types they should expect on the reading side

I think it's possible to end up with binary, large binary, or fixed-size binary that all use a byte array in Parquet land. (I don't know if it is desirable to allow all three of those but they are all ways to represent binary). GeoArrow doesn't mention the fixed-size option in its spec and I think we're planning to keep it that way (or perhaps I'm forgetting a previous thread).

hobu commented 1 year ago

As another data point, I also only implemented arrow::Type::BINARY for PDAL, but I was confused if I should support all three possibilities.

paleolimbot commented 1 year ago

In the absence of more informed guidance on how binary-like things get encoded in Parquet files (I'm somewhat new to this), I would gander that it is probably a good idea to support all three. It's a bit of a learning curve, but in theory arrow::type_traits ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/type_traits.h#L334-L372 ) is there to help support all of them with mostly the same code.

jorisvandenbossche commented 12 months ago

On the reading side (and when reading using the Arrow C++ library or its bindings), it depends on whether you have a file that was written by Arrow or not (i.e. whether it includes a serialized Arrow schema in the Parquet FileMetadata, which you can also turn off with store_schema=False when writing):

The fixed_size_binary Arrow type gets written as FIXED_LEN_BYTE_ARRAY on the Parquet side, so I think that should never come back when reading BYTE_ARRAY. Thus, I think we can clearly exclude the fixed size binary type for handling in applications like GDAL (given we clearly say that GeoParquet should use BYTE_ARRAY).

I'm not entirely clear how the Arrow library maps a large Parquet file with row group with a WKB column with more than 2 GB of content. I assume that would be Arrow::Type::LARGE_BINARY ?

In the first bullet point above I said that this would always map to binary, not large_binary. That's based on looking at the code (and some code comment says this, but who knows if that might be outdated). I also tried to reproduce those cases, but I didn't get it to return large binary. When reading a Parquet file with a big row group with many large string values that would require large_binary, Arrow C++ seems to always read it in some chunks, even when I set the batch size to read to a value > number of rows (at that point it seems it has some internal max batch size or max data size to process at once). So in practice, yes it seems the Arrow C++ Parquet reader wil always gives you chunked data.
When trying to create a Parquet file with one huge string that would require large_binary, I get the error "Parquet cannot store strings with size 2GB or more".


Now, if the Parquet file was written with the Arrow schema stored (eg using pyarrow, or using GDAL as I see it has that enabled), you will get back however Arrow schema of the written data, and so in that case you can get other types than just binary. I think it is good for GDAL to handle large_binary (as you already do), and we can probably also recommend that in general for reading.

For writing, I agree with what Dewey wrote above: generally you should use a row group size such that binary will suffice in most cases (you would already need huge WKB values to reach the 2GB limit for a few thousands of rows).

rouault commented 11 months ago

I've tried to sum up the outcome of that discussion in https://github.com/opengeospatial/geoparquet/pull/190