ipeaGIT / geobr

Easy access to official spatial data sets of Brazil in R and Python
https://ipeagit.github.io/geobr/
781 stars 117 forks source link

Plans to migrate from GeoPackage to GeoParquet #290

Open rafapereirabr opened 2 years ago

rafapereirabr commented 2 years ago

Context

All data sets used in geobr are currently stored in the format of GeoPackage .gpkg files. The choice for GeoPackage was an easy one. GeoPackage is a very robust, open standard and compact format for geospatial data. A key aspect here is that .gpkg files are platform-independent, so we can make sure that geobr data is consistent for both R and Python users.

Nonetheless, we are seeing major advances with the development of GeoParquet, a new data format to store geospatial vector data (point, lines, polygons). GeoParquet is built on top of Apache Parquet, a popular columnar storage format for tabular data. It is much (much!) more efficent than GeoPackage in terms of file storage as well as in terms speed to read and save files. I believe it's safe to say that GeoParquet has a bright future in the geospatial industry because of its flexibility and efficiency.

What to expect:

I would like to migrate all data sets available in geobr from GeoPackage to GeoParquet .parquet format in geobr v2.0. This should be done in 2023. I need some time fix some issues in geobr and it would be good to wait a little longer to see GeoParquet become a stable specification with more robust and stable packages to manipulate GeoParquet in R and Python.

How will this affect geobr users?

How will this affect geobr developers?

There are already libraries that can read GeoParquet files in both R and Python (see below). geobr v2.0 will need to include just a couple more package dependencies to be able to read geospatial data in .parquet format. In practice, this should have minimum effects on code development.

JoaoCarabetta commented 2 years ago

The python team supports this decision emphatically.

I just recommend to plan the transition carefully given that the geoparquet specs are not stable yet. Their current documentation expects stability at version v1.0.0, but they are still at version v0.3.0. (see text below)

Roadmap

Our aim is to get to a 1.0.0 within 'months', not years. The rough plan is:

  • 0.1 - Get the basics established, provide a target for implementations to start building against.
  • 0.2 / 0.3 - Feedback from implementations, 3D coordinates support, geometry types, crs optional.
  • 0.4 - Feedback from implementations, add spatial index.
  • 0.x - Several iterations based on feedback from implementations.
  • 1.0.0-RC.1 - Aim for this when there are at least 6 implementations that all work interoperably and all feel good about the spec.
  • 1.0.0 - Once there are 12(?) implementations in diverse languages we will lock in for 1.0

Our detailed roadmap is in the Milestones and we'll aim to keep it up to date.

rafapereirabr commented 10 months ago

For the record, GeoParquet v1.0.0 (stable) has now been released.

In order to implement GeoParquet in geobr, we still need to investigate the best approaches / packages to read geoparquet into R and Python. Because this is all very recent, it might take a few months before we have stable R and Python packages to do this.