Open peetmate opened 11 months ago
@paleolimbot, do you have any tips on how to load a .parquet file with geometry? I once tried to load using {arrow}
, but I couldn't find a way to convert from arrow binary to WKT.
The code below works without {sfarrow}
, but requires {jsonlite}
to parse JSON with metadata and {wk}
to convert WKB to WKT.
library("wk")
library("terra")
library("arrow")
library("jsonlite")
# sample data
url = "https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet"
tmp = tempfile(fileext = ".parquet")
download.file(url, tmp, mode = "wb")
x = arrow::read_parquet(tmp, as_data_frame = FALSE)
metadata = jsonlite::fromJSON(x$metadata$geo)
crs = paste(metadata$columns$geometry$crs$id, collapse = ":")
x = as.data.frame(x)
geom = vector("list", length = length(x$geometry))
for (i in seq_len(length(x$geometry))) geom[[i]] = x$geometry[[i]]
wkt = as.character(wk::as_wkt(wk::wkb(geom)))
v = vect(wkt, crs = crs)
idx = which(colnames(x) == "geometry")
v = cbind(v, x[, -idx])
v
# class : SpatVector
# geometry : polygons
# dimensions : 5, 5 (geometries, attributes)
# extent : -180, 180, -18.28799, 83.23324 (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84
# names : pop_est continent name iso_a3 gdp_md_est
# type : <num> <chr> <chr> <chr> <int>
# values : 8.9e+05 Oceania Fiji FJI 5496
# 5.801e+07 Africa Tanzania TZA 63177
Sorry for the delay here!
I imagine this will only have reasonable levels of performance if a SpatVect
can be created from something with a WKB column (basically what you have above except without the WKT conversion).
The rewrite of the geoarrow package (at https://github.com/geoarrow/geoarrow-c ) will have a GeoParquet reader as well and should make it more straightforward to create geometry directly from ArrowArray
s (i.e., skip the intermediary list()
of raw()
, which is sometimes a bottleneck for performance). That probably won't be released to CRAN until early next year, though.
doesn't it already work? I use vect() routinely to read from non-geom Parquet urls and pretty sure also with geometry.
Seems fine, maybe it's about the GDAL version you have available. Parquet is since version 3.5, and so you might be slammed - even ubuntu 22.04 is still GDAL 3.4.
library(terra)
vect("/vsicurl/https://github.com/OSGeo/gdal/raw/master/autotest/ogr/data/parquet/poly_wkb_large_binary.parquet")
class : SpatVector
geometry : polygons
dimensions : 10, 3 (geometries, attributes)
extent : 478315.5, 481645.3, 4762880, 4765610 (xmin, xmax, ymin, ymax)
source : poly_wkb_large_binary.parquet
coord. ref. : OSGB36 / British National Grid (EPSG:27700)
names : AREA EAS_ID PRFEDEA
type : <num> <int> <chr>
values : 2.152e+05 168 35043411
2.473e+05 179 35043423
2.618e+05 171 35043414
vect("/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet")
class : SpatVector
geometry : polygons
dimensions : 5, 5 (geometries, attributes)
extent : -180, 180, -18.28799, 83.23324 (xmin, xmax, ymin, ymax)
source : example.parquet
coord. ref. : lon/lat WGS 84 (CRS84) (OGC:CRS84)
names : pop_est continent name iso_a3 gdp_md_est
type : <num> <chr> <chr> <chr> <int>
values : 8.9e+05 Oceania Fiji FJI 5496
5.801e+07 Africa Tanzania TZA 63177
6.033e+05 Africa W. Sahara ESH 907
check your version and drivers with
terra::gdal()
terra::gdal(drivers = TRUE)
https://gdal.org/drivers/vector/parquet.html
On ubuntu use the unstable ppa to get 3.6 (though still that might not have the reqs)
https://launchpad.net/~ubuntugis/+archive/ubuntu/ubuntugis-unstable
On Windows, R 4.3.2 includes GDAL 3.7.2, but it doesn't seem to support parquet (driver not compiled).
Edit: You are right, it's 3.7.2 now, not 3.6.2.
let's get onto that then, Roger has retired 🙏
(my windows has 3.7.2 I think, but I don't pay any attention to that os anymore except to check cran versions )
@kalibera, could you please consider adding this driver to GDAL in R? But maybe this won't be necessary if @paleolimbot completes the {geoarrow}
package?
I asked Tomas a week ago and he wrote:
Apache Arrow does not have a build configuration in MXE, so someone would have to create that. I can try, but if you could get someone else to do that, it might happen sooner. The same applies to Apache Parquet if that is needed separately.
A build configuration for MXE is on our radar for the Arrow R package, too, but because the Arrow R package requires so many of Arrow's optional features (far beyond just Parquet), it's likely to be quite a bit of work. I think using the Parquet reader in the Arrow R package is probably the only reasonable approach for the foreseeable future.
From terra's side, the way to maximize performance in both cases would be to implement conversion from a WKB represented by nanoarrow_array
into whatever representation is used for geometry by SpatVect
.
Terra support for reading in .parquet vector files would be most welcome. Currently I am using sfarrow package to read into R then converting to spatvect.