opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
795 stars 56 forks source link

Make JSON Schema definition a core part of the specification #64

Closed kylebarron closed 2 years ago

kylebarron commented 2 years ago

In https://github.com/opengeospatial/geoparquet/issues/7 we decided to store metadata as a JSON-encoded blob. In https://github.com/opengeospatial/geoparquet/pull/58, which adds a schema validator, a JSON schema definition is included, but not prominently featured as a core part of the GeoParquet specification.

Learning from previous specs like STAC (which includes JSON Schema definitions for each part of the spec, and allows for a wide range of tools leveraging the JSON schema), I propose that we make this JSON Schema definition a core part of the specification.

Benefits include:

cholmes commented 2 years ago

Thanks for making the issue @kylebarron, and your nice articulation about the benefits.

I'm +1 on it. I like treating the machine readable, validating format as almost even more 'definitive' than the words, since it can make it very clear exactly what's necessary. And enables validators that give quick feedback to implementors on what they got right/wrong, instead of trying to read through the text and interpret it correctly.

It hadn't occurred to me that we could have a schema like this, since parquet is not a pure JSON format like STAC. But it makes tons of sense for the metadata, and to build tools that check the metadata.

I guess my only question is what if anything we should do to validate the data itself. Like it seems like a validator should not just check the metadata, but should also check to make sure that you actually have proper WKB (or a future geometry format) in your column? And perhaps to also do checks like if the geometries correspond to the geometry_type, maybe some basic checks on the coordinates (all within -90/90/-180/180 when no crs is specified), etc.

kylebarron commented 2 years ago

I guess my only question is what if anything we should do to validate the data itself.

I think this could easily be its own issue for discussion. Validating the metadata is much simpler than validating the data itself. For one, the metadata is a relatively small amount of data, which makes it quick to validate even against remote files, while the Parquet file could hold many GBs of data.

cholmes commented 2 years ago

I think this could easily be its own issue for discussion.

Created one, and then converted to a discussion, since it seems like the result wouldn't be a spec change.

https://github.com/opengeospatial/geoparquet/discussions/67

cholmes commented 2 years ago

closed with #93