opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
767 stars 53 forks source link
apache-parquet cloud-native geoparquet geospatial gis

GeoParquet

About

This repository defines a specification for how to store geospatial vector data (point, lines, polygons) in Apache Parquet, a popular columnar storage format for tabular data - see this vendor explanation for more on what that means. Our goal is to standardize how geospatial data is represented in Parquet to further geospatial interoperability among tools using Parquet today, and hopefully help push forward what's possible with 'cloud-native geospatial' workflows. There are now more than 20 different tools and libraries in 6 different languages that support GeoParquet, you can learn more at geoparquet.org.

Early contributors include developers from GeoPandas, GeoTrellis, OpenLayers, Vis.gl, Voltron Data, Microsoft, CARTO, Azavea, Planet & Unfolded. Anyone is welcome to join the project, by building implementations, trying it out, giving feedback through issues and contributing to the spec via pull requests. Initial work started in the geo-arrow-spec GeoPandas repository, and that will continue on Arrow work in a compatible way, with this specification focused solely on Parquet. We are in the process of becoming an OGC official Standards Working Group and are on the path to be a full OGC standard.

The latest stable specification and JSON schema are published at geoparquet.org/releases/.

The community has agreed on this release, but it is still pending OGC approval. We are currently working on the process to get it officially OGC approved as soon as possible. The OGC candidate Standard is at https://docs.ogc.org/DRAFTS/24-013.html. The candidate Standard remains in draft form until it is approved as a Standard by the OGC Membership. Released versions of GeoParquet will not be changed, so if changes are needed for OGC approval, it will be released with a new version number.

The 'dev' versions of the spec are available in this repo:

Validating GeoParquet

There are two tools that validate the metadata and the actual data. It is recommended to use one of them to ensure any GeoParquet you produce or are given is completely valid according to the specification:

Goals

There are a few core goals driving the initial development.

And our broader goal is to innovate with 'cloud-native vector' providing a stable base to try out new ideas for cloud-native & streaming workflows.

Features

A quick overview of what GeoParquet supports (or at least plans to support).

It should be noted what GeoParquet is less good for. The biggest one is that it is not a good choice for write-heavy interactions. A row-based format will work much better if it is backing a system that is constantly updating the data and adding new data.

Versioning

As of version 1.0 the specification follows Semantic Versioning, so at that point any breaking change will require the spec to go to 2.0.0.

Current Implementations & Examples

Examples of GeoParquet files following the current spec can be found in the examples/ folder. For information on all the tools and libraries implementing GeoParquet, as well as sample data, see the implementations section of the website.