paleolimbot / narrow

An R interface to the 'Apache Arrow' C API
https://paleolimbot.github.io/narrow/
Other
30 stars 3 forks source link

Fedora 36 gcc 12 problems #7

Open rsbivand opened 2 years ago

rsbivand commented 2 years ago

narrow-check.zip shows what happens when CMD check on Fedora 36 gcc 12.1.1 and the Fedora libarrow etc. binaries are used. Same for geoarrow: geoarrow-check.zip

The GDAL 3.5.0 Arrow and Parquet drivers read the GDAL autotest/ogr/data examples OK - my reason for raising a diffuse issue is https://github.com/wcjochem/sfarrow/issues/14, a new failing revdep check on F36/gcc12 and Fedora's arrow RPMs (no external arrow libraries at all previously). I think, since you are obviously looking at checking the GDAL drivers, it might be worth seeing how to cross-check sf with GDAL 3.5.0 and the two experimental drivers, geoarrow and sfarrow (and others if they appear - maybe Python and reticulate). What do you think (@edzer for reference, this format feels promising)?

paleolimbot commented 2 years ago

As always, thank you for running these checks!

The short-term fix for any check where "dataset" is required needs to check isTRUE(arrow::arrow_info()$capabilities["dataset"]) before trying to do anything dataset-related.

In those results, the arrow package seems to be using system Arrow. I believe it's supposed to only do that if forced by the user, since a system build might not contain the capabilities that most arrow package users expect (e.g., read/write dataset and parquet). Homebrew also provides a version of Arrow for Mac that presents some problems when used with the arrow R package.

I have a build of Arrow and GDAL on Ubuntu Focal that I've been using to cross-check files (see 'details' for the cmake invokation I use to build it). The big difference between the files that geoarrow writes and the files that GDAL writes is that geoarrow also writes extension type information, which causes problems when all of geoarrow, arrow, and sf are loaded AND linking against the same libarrow.so (error below in 'details' in case you run into it). I'm hoping to PR a fix to that into GDAL because the extension type information is super useful when non-geo-aware tools like arrow::open_dataset("some_dir_with_lots_of_parquet_files").

I've found it quite difficult explain to folks how to get a Arrow-enabled GDAL build...it might be possible to built it into GDAL on Mac and Windows (there is an MSYS2 package and we could PR a recipe for building GDAL with Arrow into the mac builds repo). As you've seen here, the interaction between the libarrow.so used by the arrow R package and the libarrow.so used by GDAL has some oddities that need to be ironed out.

I'm also working on a post about why the Parquet format is really cool for vector data, although its advantages are hard to realize through the GDAL driver (which squeezes all file formats through the tiny rowwise hole that is OGR). That's why I'm putting so much time into geoarrow (because then we can read files really fast through the Arrow package, leveraging all the engineering around ALTREP and query engine capability that's been put in there). I'm playing with the default conversion from to ...right now it's just a zero-copy shell around the Arrow Array but could in theory be a conversion to sfc (see my internal monologue at https://github.com/paleolimbot/geoarrow/issues/21).

```bash export ARROW_HOME=/home/dewey/.r-arrow-dev-build/dist export ARROW_SRC=/home/dewey/Desktop/r/arrow export ARROW_VIRTUALENV=/home/dewey/Desktop/r/pyarrow-dev rm -rf $ARROW_HOME mkdir -p $ARROW_HOME rm -rf $ARROW_HOME/../build mkdir $ARROW_HOME/../build cd $ARROW_HOME/../build cmake \ -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DCMAKE_BUILD_TYPE=debug \ -DPython3_EXECUTABLE=$ARROW_VIRTUALENV/bin/python \ -DARROW_BOOST_USE_SHARED=OFF \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_COMPUTE=ON \ -DARROW_CSV=ON \ -DARROW_DATASET=ON \ -DARROW_EXTRA_ERROR_CONTEXT=ON \ -DARROW_FILESYSTEM=ON \ -DARROW_INSTALL_NAME_RPATH=OFF \ -DARROW_JEMALLOC=ON \ -DARROW_JSON=ON \ -DARROW_PARQUET=ON \ -DARROW_WITH_SNAPPY=ON \ -DARROW_WITH_ZLIB=ON \ -DARROW_MIMALLOC=ON \ -DARROW_S3=ON \ -DARROW_WITH_BROTLI=ON \ -DARROW_WITH_BZ2=ON \ -DARROW_WITH_LZ4=ON \ -DARROW_WITH_SNAPPY=ON \ -DARROW_WITH_ZSTD=ON \ -DARROW_ENGINE=ON \ -DARROW_PYTHON=ON \ "$ARROW_SRC/cpp" && make -j8 install && echo "Arrow built and installed to ARROW_HOME='$ARROW_HOME'" ``` Warning messages: 1: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL Message 1: Geometry column geometry has a type != fixed_size_list[2]>: extension. Handling it as a regular field 2: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL Message 1: Field geometry of unhandled type geoarrow.linestring
rsbivand commented 2 years ago

I have a vague feeling that:

$ pkg-config --libs arrow-dataset
-larrow_dataset -lparquet -larrow 

is telling me something - the linked libraries in arrow are PKG_LIBS= -larrow.

I have:

> arrow::arrow_info()$capabilities
  dataset substrait   parquet      json        s3  utf8proc       re2    snappy 
    FALSE     FALSE     FALSE     FALSE     FALSE      TRUE      TRUE      TRUE 
     gzip    brotli      zstd       lz4 lz4_frame       lzo       bz2 
     TRUE      TRUE      TRUE      TRUE      TRUE     FALSE      TRUE 

Unfortunately, https://arrow.apache.org/docs/r/articles/install.html doesn't seem to address the f36 native RPM setting.

paleolimbot commented 2 years ago

Ah, the configure script uses the pkg-config name arrow instead of arrow-dataset. I'll open an issue in Arrow about this since it's a pretty easy way to get Arrow + R + GDAL up and running: https://github.com/apache/arrow/blob/master/r/configure#L29. We also apparently don't provide a libarrow binary for fedora so the NOT_CRAN=true ARROW_USE_PKG_CONFIG=false combination doesn't work either.

rsbivand commented 2 years ago

OK, thanks - please let me know if I can test something using the Fedora libarrow etc. binaries now available for now-released Fedora 36.