rspatial / terra

R package for spatial data handling https://rspatial.github.io/terra/reference/terra-package.html
GNU General Public License v3.0
541 stars 89 forks source link

New Feature: Parquet vector support #1347

Open peetmate opened 11 months ago

peetmate commented 11 months ago

Terra support for reading in .parquet vector files would be most welcome. Currently I am using sfarrow package to read into R then converting to spatvect.

kadyb commented 11 months ago

@paleolimbot, do you have any tips on how to load a .parquet file with geometry? I once tried to load using {arrow}, but I couldn't find a way to convert from arrow binary to WKT.

kadyb commented 11 months ago

The code below works without {sfarrow}, but requires {jsonlite} to parse JSON with metadata and {wk} to convert WKB to WKT.

library("wk")
library("terra")
library("arrow")
library("jsonlite")

# sample data
url = "https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet"
tmp = tempfile(fileext = ".parquet")
download.file(url, tmp, mode = "wb")

x = arrow::read_parquet(tmp, as_data_frame = FALSE)
metadata = jsonlite::fromJSON(x$metadata$geo)
crs = paste(metadata$columns$geometry$crs$id, collapse = ":")
x = as.data.frame(x)
geom = vector("list", length = length(x$geometry))
for (i in seq_len(length(x$geometry))) geom[[i]] = x$geometry[[i]]
wkt = as.character(wk::as_wkt(wk::wkb(geom)))
v = vect(wkt, crs = crs)
idx = which(colnames(x) == "geometry")
v = cbind(v, x[, -idx])
v
# class       : SpatVector 
# geometry    : polygons 
# dimensions  : 5, 5  (geometries, attributes)
# extent      : -180, 180, -18.28799, 83.23324  (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 
# names       :   pop_est continent      name iso_a3 gdp_md_est
# type        :     <num>     <chr>     <chr>  <chr>      <int>
# values      :   8.9e+05   Oceania      Fiji    FJI       5496
#               5.801e+07    Africa  Tanzania    TZA      63177
paleolimbot commented 11 months ago

Sorry for the delay here!

I imagine this will only have reasonable levels of performance if a SpatVect can be created from something with a WKB column (basically what you have above except without the WKT conversion).

The rewrite of the geoarrow package (at https://github.com/geoarrow/geoarrow-c ) will have a GeoParquet reader as well and should make it more straightforward to create geometry directly from ArrowArrays (i.e., skip the intermediary list() of raw(), which is sometimes a bottleneck for performance). That probably won't be released to CRAN until early next year, though.

mdsumner commented 11 months ago

doesn't it already work? I use vect() routinely to read from non-geom Parquet urls and pretty sure also with geometry.

Seems fine, maybe it's about the GDAL version you have available. Parquet is since version 3.5, and so you might be slammed - even ubuntu 22.04 is still GDAL 3.4.

library(terra)
 vect("/vsicurl/https://github.com/OSGeo/gdal/raw/master/autotest/ogr/data/parquet/poly_wkb_large_binary.parquet")
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 10, 3  (geometries, attributes)
 extent      : 478315.5, 481645.3, 4762880, 4765610  (xmin, xmax, ymin, ymax)
 source      : poly_wkb_large_binary.parquet
 coord. ref. : OSGB36 / British National Grid (EPSG:27700) 
 names       :      AREA EAS_ID  PRFEDEA
 type        :     <num>  <int>    <chr>
 values      : 2.152e+05    168 35043411
               2.473e+05    179 35043423
               2.618e+05    171 35043414
vect("/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet")

class       : SpatVector 
 geometry    : polygons 
 dimensions  : 5, 5  (geometries, attributes)
 extent      : -180, 180, -18.28799, 83.23324  (xmin, xmax, ymin, ymax)
 source      : example.parquet
 coord. ref. : lon/lat WGS 84 (CRS84) (OGC:CRS84) 
 names       :   pop_est continent      name iso_a3 gdp_md_est
 type        :     <num>     <chr>     <chr>  <chr>      <int>
 values      :   8.9e+05   Oceania      Fiji    FJI       5496
               5.801e+07    Africa  Tanzania    TZA      63177
               6.033e+05    Africa W. Sahara    ESH        907

check your version and drivers with

terra::gdal()

terra::gdal(drivers = TRUE)

https://gdal.org/drivers/vector/parquet.html

On ubuntu use the unstable ppa to get 3.6 (though still that might not have the reqs)

https://launchpad.net/~ubuntugis/+archive/ubuntu/ubuntugis-unstable

kadyb commented 11 months ago

On Windows, R 4.3.2 includes GDAL 3.7.2, but it doesn't seem to support parquet (driver not compiled).

Edit: You are right, it's 3.7.2 now, not 3.6.2.

mdsumner commented 11 months ago

let's get onto that then, Roger has retired 🙏

(my windows has 3.7.2 I think, but I don't pay any attention to that os anymore except to check cran versions )

kadyb commented 11 months ago

@kalibera, could you please consider adding this driver to GDAL in R? But maybe this won't be necessary if @paleolimbot completes the {geoarrow} package?

rhijmans commented 11 months ago

I asked Tomas a week ago and he wrote:

Apache Arrow does not have a build configuration in MXE, so someone would have to create that. I can try, but if you could get someone else to do that, it might happen sooner. The same applies to Apache Parquet if that is needed separately.

paleolimbot commented 11 months ago

A build configuration for MXE is on our radar for the Arrow R package, too, but because the Arrow R package requires so many of Arrow's optional features (far beyond just Parquet), it's likely to be quite a bit of work. I think using the Parquet reader in the Arrow R package is probably the only reasonable approach for the foreseeable future.

From terra's side, the way to maximize performance in both cases would be to implement conversion from a WKB represented by nanoarrow_array into whatever representation is used for geometry by SpatVect.