Open Robinlovelace opened 7 months ago
Thanks for raising this, Robin. Integration with the duckdb spatial extension would be a really cool feature, but also a lot of work.
Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?
Adding support for functions is then "only" a matter of diligence: https://github.com/duckdblabs/duckplyr/pull/179/files#diff-a202cfba76540d6822868ac7755edd4945b6344057d78e0092f4836e33c0d4eaR11 .
Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?
I imagine so, and given that everything other than the geometry column is already sorted, it's just the geometry that needs converting (safe to assume just 1 geometry column in 99% of use cases I think).
Seems like DuckDB -> sf has been implemented here: https://github.com/cboettig/duckdbfs/blob/main/R/to_sf.R
Not sure how hard the other way would be let alone how to make it fast.
The duckdb -> sf conversion there is mostly solid, but could be a bit better. Currently there's a couple different ways in which geospatial data is stored in duckdb:
If duckdb reads in a vector format file (shapefile, geodatabase, anything BUT geoparquet), it parses with gdal and converts to duckdb's internal geometry. This is the use-case that the above handles. (Though I think the column name for the geometry is inherited from the file, e.g. might not be called geometry
, so really we need to handle this.
If duckdb reads in geoparquet, it does not use the gdal parser (because duckdb's native parquet parser is so much faster!). However, this also means (at least currently) that the column is read in as a binary blob and not the native geometry, so we need an extra coercion. I've been meaning to add this, though it might eventually be solved upstream, see https://github.com/duckdb/duckdb_spatial/issues/299#event-12557731125
Re sf -> duckdb, I don't think this is much of an issue, though there are various ways to do it depending on precisely what you mean by "to duckdb". Specifically, I think the best thing to do is simply have sf write out as a geoparquet file to disk. (this assumes sf is built with recent gdal that has arrow support of course!). Since presumably this use case means the data is small enough to fit in RAM, writing out as, say, geodatabase is probably just as good (maybe better given the issue noted above), and then have duckdb read that in. It is possible to write to duckdb's native database format with DBI instead (i.e. with the WKB-binary column), and then you'd need the extra coercion once in duckdb to make it into duckdb's internal spatial type, but I don't see the use for that. (For most users I think it's actually better to pretend that duckdb's native database doesn't exist and work directly against flat files).
Sorry, long story short, I think duckdbfs should handle both cases (simply noting that sf should serialize to disk in any standard spatial format), modulo this edge case about geoparquet.
I would use {wk} for the sf<->wkb<->blob conversion, it supports a wide range of other conversions already (not terra::vect sadly).
Should BLOB type be already supported?
I see
## wget https://data.source.coop/fused/overture/2024-02-15-alpha-0/theme=admins/type=administrativeBoundary/0.parquet
duckplyr::duckplyr_df_from_parquet("0.parquet")
Error: rel_to_altrep: Unknown column type for altrep: BLOB
This would otherwise look like this
arrow::open_dataset("0.parquet") |> dplyr::select(geometry) |> dplyr::collect() |> dplyr::mutate(geometry = wk::wkb(geometry))
# A tibble: 2,587 × 1
geometry
<wk_wkb>
1 <LINESTRING (-175.3083 -21.12098, -175.3094 -21.12427, -175.3098 -21.12571, …
2 <LINESTRING (-175.2667 -21.14462, -175.2673 -21.14619, -175.2681 -21.14822, …
3 <LINESTRING (-175.2686 -21.12686, -175.2684 -21.12997, -175.2692 -21.13471, …
I don't think any of the spatial stuff belongs here, unless an import of wk is welcome ... I suggest return the binary as-is, or as {blob}. For sf itself it has st_as_sf()
and handles this more generic basis provided by wk. {RODBC} fwiw did support this geometry read way back in the 2000s, and worked well with various backends but that long predated dataframe and blob vector support.
For general read via GDAL, I would look at the vector support in {gdalraster} and (we can do it!) work on a lazy vctrs form for the OGR pointer type, alternatively GDAL can provide geos pointers directly to {geos}. sf doesn't have any capacity for these lazy or alternative/intermediate forms for the geometry from general sources so I don't think it's a good thing to focus on always (it's well supported by conversions already).
It seems the goal for duckplyr for spatial should aim to expose to the R user the spatial abilities of duckdb directly.
The ibis
project in Python seems like a natural analogue here -- as you probably know already ibis is essentially a dbplyr
for python. When using the duckdb backend engine, it supports many though not yet all of the spatial abilities in duckdb[spatial], returning a geopandas data.frame if the user calls to_pandas()
(which is essentially ibis analogy to collect()
. e.g https://ibis-project.org/posts/ibis-duckdb-geospatial-dev-guru/
Cool stuff, keeping a beady eye on this conversation, thanks for keeping it rolling forward.
This looks great! One feature request I have in mind is support (either via new functions/functionality or via documentation if it works out of the boxx) for spatial data. See this by @cboettig for inspiration: https://github.com/cboettig/duckdbfs#spatial-data
Another potential source of inspiration is
sf
's support fortidy
operations, it great howsummarise()
and other functions 'just work' with tidy verbs: https://r-spatial.github.io/sf/reference/tidyverse.html