paradedb / paradedb

Postgres for Search and Analytics
https://paradedb.com
GNU Affero General Public License v3.0
4.72k stars 125 forks source link

Support for Geospatial data using GeoParquet (or similar) #865

Open humaidkidwai opened 4 months ago

humaidkidwai commented 4 months ago

What I would be interested to see support for geospatial data in ParadeDB. As Postgres has a PostGIS extension that can handle a diverse set of geospatial use cases, ParadeDB could possibly add support for something similar but using column stores. GeoParquet is an OGC incubated file format which essentially extends the Parquet format to support standard vector data in WKT/WKB

Why Support for GeoParquet will be super helpful as an increasing number of organizations (source.coop Microsoft) transforming their data to a cloud native file format for interoperability and geospatial analysis at scale. DuckDB supports it already and many are to follow suit

How As GeoParquet is not an entirely new file format and just a specification for the existing Parquet format (by specifying additional geo metadata with every parquet file), it can be smoothly integrated with ParadeDB's native support for Parquet

philippemnoel commented 4 months ago

Hi @humaidkidwai! Thanks for filing. We're converting our base Dockerfile from the standard Postgres to Bitnami's Postgres, which includes PostGIS and is more production-ready. We'll also be including pg_cron.

Do you need PostGIS, or you specifically need GeoParquet/geo data over columnar tables? That is something we can perhaps do eventually, but it would be much lower priority

humaidkidwai commented 4 months ago

That's good to know @philippemnoel, PostGIS support will be superb! I would be more interested in GeoParquet which is essentially Parquet with a standardized way of storing geometries. Will keep following

philippemnoel commented 4 months ago

This you mean? https://github.com/opengeospatial/geoparquet

We're open to it, but I see some pretty serious blocker:

We use Parquet via delta-rs. So there would need to be a Rust-based implementation that's mature enough as a crate to be implemented by delta-rs. Until that day, we won't be able to integrate it within ParadeDB

Are you familiar with such an initiative?

humaidkidwai commented 4 months ago

There is some work on that front by Kyle Barron, he is working on a Rust-based implementation of GeoArrow geoarrow-rs and recently merged some changes to support reading and writing GeoParquet. However I wouldn't go as far as to say that it is mature enough yet.

philippemnoel commented 4 months ago

Alright, well excited to follow the development!

kylebarron commented 4 months ago

GeoParquet might not work out of the box with delta lake (there are ongoing spec discussions for iceberg compatibility) and I wouldn't be surprised if delta-rs would want to implement geo support in an extension anyways. And then geo support in datafusion doesn't exist yet (my work is a precursor to it, but I'm not focusing on datafusion integration yet).

So it's probably a while before it's directly integratable in paradedb

philippemnoel commented 4 months ago

GeoParquet might not work out of the box with delta lake (there are ongoing spec discussions for iceberg compatibility) and I wouldn't be surprised if delta-rs would want to implement geo support in an extension anyways. And then geo support in datafusion doesn't exist yet (my work is a precursor to it, but I'm not focusing on datafusion integration yet).

So it's probably a while before it's directly integratable in paradedb

Yeah that makes sense -- that's the feeling I was getting as well. Thank you for chiming in and good luck with your work :) We're excited to follow along and integrate it in ParadeDB once we can!

philippemnoel commented 2 months ago

We've added support for PostGIS, by the way. As far as GeoParquet, that is in @kylebarron's hands :)

philippemnoel commented 1 week ago

@kylebarron -- we're considering moving to DuckDB as the engine powering our analytics offering. It seems to support GeoParquet in some form, as per: https://github.com/cholmes/duckdb-geoparquet-tutorials

Are you familiar at all? Would love your input on this.

humaidkidwai commented 1 week ago

DuckDB just had support for GeoParquet merged

kylebarron commented 1 week ago

we're considering moving to DuckDB as the engine powering our analytics offering

That's a big change! Instead of datafusion? That would be interesting to read why

Yeah I think the next release of DuckDB-spatial has GeoParquet support planned.

philippemnoel commented 1 week ago

we're considering moving to DuckDB as the engine powering our analytics offering

That's a big change! Instead of datafusion? That would be interesting to read why

Yeah I think the next release of DuckDB-spatial has GeoParquet support planned.

We'll write about why, I'm excited to share it with you. We'll still be using DataFusion, but in a different part of the stack. More here soon :)

archiewood commented 4 days ago

following this eagerly

are there any (early stage) docs about how to use the geoparquet functionality?