opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
780 stars 54 forks source link

Encoding of the key-value metadata when stored in the Parquet FileMetadata #7

Closed jorisvandenbossche closed 2 years ago

jorisvandenbossche commented 2 years ago

Currently, this is not exactly described as such in the spec (https://github.com/opengeospatial/cdw-geo/pull/6 is clarifying this), but in practice we are storing the geospatial metadata as a JSON-encoded string (json.dumps(..) in python terms, see the example file and implementation at https://github.com/opengeospatial/cdw-geo/tree/main/examples/geoparquet).

This means that the actual value that we store in the Parquet FileMetaData's key_value_metadata under the "geo" key, is a string value like '{"version": "0.1.0", "primary_column": "geometry", "columns": {"geometry": {"crs": ... }}}'

I am opening this issue to confirm explicitly that we are fine with this, or whether we want to consider a different formatting while further refining the spec.

jorisvandenbossche commented 2 years ago

FYI, we have a similar discussion on the Arrow side (for defining Arrow extension types, where we also have to store metadata) in geo-arrow-spec, see https://github.com/geopandas/geo-arrow-spec/issues/17.

There, we are considering to not use JSON encoded string, because this is generally a quite "heavy" format to robustly parse, if you want to do this, for example, in plain C (without vendoring some JSON library). For the Arrow extension types, this is a typical application, because we want to use this format to pass data around using the Arrow C Data Interface (https://arrow.apache.org/docs/format/CDataInterface.html). On the other hand, I assume that it is less likely that someone is going to reimplement a Parquet reader in pure C, and thus that within the context of reading a Parquet file, a JSON encoded string might be less of a problem.

paleolimbot commented 2 years ago

I think JSON is great for file-level metadata...anybody crazy enough to implement Parquet in C will have JSON parsing as the least of their worries.