tidyverse / duckplyr

A drop-in replacement for dplyr, powered by DuckDB for performance.
https://duckplyr.tidyverse.org/
Other
249 stars 15 forks source link

Support for geo data #91

Open Robinlovelace opened 7 months ago

Robinlovelace commented 7 months ago

This looks great! One feature request I have in mind is support (either via new functions/functionality or via documentation if it works out of the boxx) for spatial data. See this by @cboettig for inspiration: https://github.com/cboettig/duckdbfs#spatial-data

Another potential source of inspiration is sf's support for tidy operations, it great how summarise() and other functions 'just work' with tidy verbs: https://r-spatial.github.io/sf/reference/tidyverse.html

krlmlr commented 2 months ago

Thanks for raising this, Robin. Integration with the duckdb spatial extension would be a really cool feature, but also a lot of work.

Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?

Adding support for functions is then "only" a matter of diligence: https://github.com/duckdblabs/duckplyr/pull/179/files#diff-a202cfba76540d6822868ac7755edd4945b6344057d78e0092f4836e33c0d4eaR11 .

Robinlovelace commented 2 months ago

Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?

I imagine so, and given that everything other than the geometry column is already sorted, it's just the geometry that needs converting (safe to assume just 1 geometry column in 99% of use cases I think).

Robinlovelace commented 2 months ago

Seems like DuckDB -> sf has been implemented here: https://github.com/cboettig/duckdbfs/blob/main/R/to_sf.R

Not sure how hard the other way would be let alone how to make it fast.

cboettig commented 2 months ago

The duckdb -> sf conversion there is mostly solid, but could be a bit better. Currently there's a couple different ways in which geospatial data is stored in duckdb:

Re sf -> duckdb, I don't think this is much of an issue, though there are various ways to do it depending on precisely what you mean by "to duckdb". Specifically, I think the best thing to do is simply have sf write out as a geoparquet file to disk. (this assumes sf is built with recent gdal that has arrow support of course!). Since presumably this use case means the data is small enough to fit in RAM, writing out as, say, geodatabase is probably just as good (maybe better given the issue noted above), and then have duckdb read that in. It is possible to write to duckdb's native database format with DBI instead (i.e. with the WKB-binary column), and then you'd need the extra coercion once in duckdb to make it into duckdb's internal spatial type, but I don't see the use for that. (For most users I think it's actually better to pretend that duckdb's native database doesn't exist and work directly against flat files).

Sorry, long story short, I think duckdbfs should handle both cases (simply noting that sf should serialize to disk in any standard spatial format), modulo this edge case about geoparquet.

mdsumner commented 1 month ago

I would use {wk} for the sf<->wkb<->blob conversion, it supports a wide range of other conversions already (not terra::vect sadly).

Should BLOB type be already supported?

I see

## wget https://data.source.coop/fused/overture/2024-02-15-alpha-0/theme=admins/type=administrativeBoundary/0.parquet
duckplyr::duckplyr_df_from_parquet("0.parquet")
Error: rel_to_altrep: Unknown column type for altrep: BLOB

This would otherwise look like this

arrow::open_dataset("0.parquet") |> dplyr::select(geometry) |> dplyr::collect() |>  dplyr::mutate(geometry = wk::wkb(geometry))
# A tibble: 2,587 × 1
   geometry
   <wk_wkb>
 1 <LINESTRING (-175.3083 -21.12098, -175.3094 -21.12427, -175.3098 -21.12571, …
 2 <LINESTRING (-175.2667 -21.14462, -175.2673 -21.14619, -175.2681 -21.14822, …
 3 <LINESTRING (-175.2686 -21.12686, -175.2684 -21.12997, -175.2692 -21.13471, …

I don't think any of the spatial stuff belongs here, unless an import of wk is welcome ... I suggest return the binary as-is, or as {blob}. For sf itself it has st_as_sf() and handles this more generic basis provided by wk. {RODBC} fwiw did support this geometry read way back in the 2000s, and worked well with various backends but that long predated dataframe and blob vector support.

For general read via GDAL, I would look at the vector support in {gdalraster} and (we can do it!) work on a lazy vctrs form for the OGR pointer type, alternatively GDAL can provide geos pointers directly to {geos}. sf doesn't have any capacity for these lazy or alternative/intermediate forms for the geometry from general sources so I don't think it's a good thing to focus on always (it's well supported by conversions already).

cboettig commented 4 weeks ago

It seems the goal for duckplyr for spatial should aim to expose to the R user the spatial abilities of duckdb directly.

The ibis project in Python seems like a natural analogue here -- as you probably know already ibis is essentially a dbplyr for python. When using the duckdb backend engine, it supports many though not yet all of the spatial abilities in duckdb[spatial], returning a geopandas data.frame if the user calls to_pandas() (which is essentially ibis analogy to collect(). e.g https://ibis-project.org/posts/ibis-duckdb-geospatial-dev-guru/

Robinlovelace commented 4 weeks ago

Cool stuff, keeping a beady eye on this conversation, thanks for keeping it rolling forward.