Open rsbivand opened 2 years ago
As always, thank you for running these checks!
The short-term fix for any check where "dataset" is required needs to check isTRUE(arrow::arrow_info()$capabilities["dataset"])
before trying to do anything dataset-related.
In those results, the arrow package seems to be using system Arrow. I believe it's supposed to only do that if forced by the user, since a system build might not contain the capabilities that most arrow package users expect (e.g., read/write dataset and parquet). Homebrew also provides a version of Arrow for Mac that presents some problems when used with the arrow R package.
I have a build of Arrow and GDAL on Ubuntu Focal that I've been using to cross-check files (see 'details' for the cmake invokation I use to build it). The big difference between the files that geoarrow writes and the files that GDAL writes is that geoarrow also writes extension type information, which causes problems when all of geoarrow, arrow, and sf are loaded AND linking against the same libarrow.so (error below in 'details' in case you run into it). I'm hoping to PR a fix to that into GDAL because the extension type information is super useful when non-geo-aware tools like arrow::open_dataset("some_dir_with_lots_of_parquet_files")
.
I've found it quite difficult explain to folks how to get a Arrow-enabled GDAL build...it might be possible to built it into GDAL on Mac and Windows (there is an MSYS2 package and we could PR a recipe for building GDAL with Arrow into the mac builds repo). As you've seen here, the interaction between the libarrow.so used by the arrow R package and the libarrow.so used by GDAL has some oddities that need to be ironed out.
I'm also working on a post about why the Parquet format is really cool for vector data, although its advantages are hard to realize through the GDAL driver (which squeezes all file formats through the tiny rowwise hole that is OGR). That's why I'm putting so much time into geoarrow (because then we can read files really fast through the Arrow package, leveraging all the engineering around ALTREP and query engine capability that's been put in there). I'm playing with the default conversion from
I have a vague feeling that:
$ pkg-config --libs arrow-dataset
-larrow_dataset -lparquet -larrow
is telling me something - the linked libraries in arrow are PKG_LIBS= -larrow
.
I have:
> arrow::arrow_info()$capabilities
dataset substrait parquet json s3 utf8proc re2 snappy
FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
gzip brotli zstd lz4 lz4_frame lzo bz2
TRUE TRUE TRUE TRUE TRUE FALSE TRUE
Unfortunately, https://arrow.apache.org/docs/r/articles/install.html doesn't seem to address the f36 native RPM setting.
Ah, the configure script uses the pkg-config name arrow
instead of arrow-dataset
. I'll open an issue in Arrow about this since it's a pretty easy way to get Arrow + R + GDAL up and running: https://github.com/apache/arrow/blob/master/r/configure#L29. We also apparently don't provide a libarrow binary for fedora so the NOT_CRAN=true ARROW_USE_PKG_CONFIG=false
combination doesn't work either.
OK, thanks - please let me know if I can test something using the Fedora libarrow etc. binaries now available for now-released Fedora 36.
narrow-check.zip shows what happens when CMD check on Fedora 36 gcc 12.1.1 and the Fedora libarrow etc. binaries are used. Same for geoarrow: geoarrow-check.zip
The GDAL 3.5.0 Arrow and Parquet drivers read the GDAL autotest/ogr/data examples OK - my reason for raising a diffuse issue is https://github.com/wcjochem/sfarrow/issues/14, a new failing revdep check on F36/gcc12 and Fedora's arrow RPMs (no external arrow libraries at all previously). I think, since you are obviously looking at checking the GDAL drivers, it might be worth seeing how to cross-check sf with GDAL 3.5.0 and the two experimental drivers, geoarrow and sfarrow (and others if they appear - maybe Python and reticulate). What do you think (@edzer for reference, this format feels promising)?