pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
405 stars 35 forks source link

`arrow` like "R source package with Rust library binary" installation #408

Closed eitsupi closed 7 months ago

eitsupi commented 8 months ago

The arrow package will attempt to download a pre-built binary from the Internet if NOT_CRAN=true. https://arrow.apache.org/docs/11.0/r/articles/install.html#r-source-package-with-libarrow-binary

This method also has the advantage that many features disabled by default (i.e., on CRAN) during source installation are enabled in the pre-built binaries.

I thought the same thing could be done here and have recently added that functionality to prqlr (eitsupi/prqlr#195). The arm64 architecture can also be supported due to cargo's excellent cross-compilation, and the glibc version should be negligible by selecting the musl target. (e.g. #86)

Unlike the binary installation via GitHub releases that this repository currently offers, it has the advantage of being valid for installation from anywhere. Like:

Sys.setenv(NOT_CRAN="true")
remotes::install_github("eitsupi/prqlr")
Sys.setenv(NOT_CRAN="true")
install.packages("prqlr", repos = "https://eitsupi.r-universe.dev/") # Of course, Windows and macOS do binary installations

For example, the installation on arm64 Linux from R-universe is as follows:

$ docker run --rm -it rocker/r-ver bash
root@3e5a9638fea4:/# R

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: aarch64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> Sys.setenv(NOT_CRAN="true")
> install.packages("prqlr", repos = "https://eitsupi.r-universe.dev/")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://eitsupi.r-universe.dev/src/contrib/prqlr_0.5.3.9000.tar.gz'
Content type 'application/x-gzip' length 191638 bytes (187 KB)
==================================================
downloaded 187 KB

* installing *source* package ‘prqlr’ ...
** using staged installation

--------------- [ TRY TO DOWNLOAD PRE-BUILT BINARY ] ---------------
Found pre-built binary at <https://github.com/eitsupi/prqlr/releases/download/lib-v0.9.0/libprqlr-0.9.0-aarch64-unknown-linux-musl.tar.gz>.
Downloading...
Checking SHA256 for </tmp/RtmpQpUMZg/filed6477f9eb5.tar.gz>...
SHA256 matches for </tmp/RtmpQpUMZg/filed6477f9eb5.tar.gz>.
Extracted pre-built binary to </tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/tools> directory.
--------------------------------------------------------------------

---------------------- [LIBRARY BINARY FOUND] ----------------------
The library was found at <tools/libprqlr.a>. No need to build it.
--------------------------------------------------------------------

** libs
using C compiler: ‘gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’
rm -Rf "prqlr.so" "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/target//release/libprqlr.a" "entrypoint.o"
gcc -I"/usr/local/lib/R/include" -DNDEBUG   -I/usr/local/include    -fPIC  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c entrypoint.c -o entrypoint.o
if [ -f "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/../tools/libprqlr.a" ]; then \
        mkdir -p "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/target//release" ; \
        mv "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/../tools/libprqlr.a" "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/target//release/libprqlr.a" ; \
        exit 0; \
fi && \
if [ -f "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/vendor.tar.xz" ]; then \
        mkdir -p "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/vendor" && \
        /usr/bin/tar --extract --xz --file "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/vendor.tar.xz" -C "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/vendor" && \
        mkdir -p "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/.cargo" && \
        cp "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/vendor-config.toml" "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/.cargo/config.toml"; \
fi && \
if [ "true" != "true" ]; then \
        export CARGO_HOME="/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/.cargo"; \
        export CARGO_BUILD_JOBS=2; \
fi && \
        export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/root/.cargo/bin" && \
        cargo build --lib --manifest-path="/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/Cargo.toml" --target-dir "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/target" --target="" \
                --profile="release" --features=""
if [ "true" != "true" ]; then \
        rm -Rf "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/.cargo" "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/vendor" "/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/target//release/build"; \
fi
gcc -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o prqlr.so entrypoint.o -L/tmp/RtmpgWnDdc/R.INSTALL1b54f3616/prqlr/src/rust/target//release -lprqlr -L/usr/local/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/00LOCK-prqlr/00new/prqlr/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (prqlr)

The downloaded source packages are in
        ‘/tmp/RtmpohJEuV/downloaded_packages’

I would like to bring this same thing here. What do you think? The administrative disadvantage is the need for additional library versioning and release work. (Don't forget to raise the version on Cargo.toml when making changes on the Rust side.)

sorhawell commented 8 months ago

For the GitHub release a windows/Mac/linux x64 user can install without Rtools/make/gcc/clang and it is faster.

Otherwise looks interesting. We could add the such binaries to release on GitHub and hardcode link to the appropriate versioned binary inside each R source package to avoid version mismatch. Edit (it seems this is what was done here also)

We might risk sometime in future it would fail if URL changes.

sorhawell commented 8 months ago

So as far I understand this. This solution is very related to our current cross compilation in the way the Makevars will download the rust binary right? It still requires some Make and a compiler and linker to wrap up the package, right?

Would it be possible to write a portable configure / configure.win which literally starts a new R install.packages call and installs the GitHub binary package if a compatible file is found for the user machine. After installation, the outer installation is aborted in a silent way. Then no Make+compiler+linker is needed for the non-cross realeases.

sorhawell commented 8 months ago

It is possible to start a direct binary installation via configure, and it actually installs. I use exit 1 to just stop the rest of normal compilation/installation, but this will trigger R to cleanup the installed package. If there was a way to make R gracefully skip the remaning installation. That could be very cool, I think.

> remotes::install_github("pola-rs/r-polars",ref = "configure_binary_install")
Downloading GitHub repo pola-rs/r-polars@configure_binary_install
── R CMD build ────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/private/var/folders/v1/b2c26lpn2yjd997jg_gn4fgc0000gn/T/Rtmp5qh4qb/remotes97f87a372e00/pola-rs-r-polars-f8d88b8/DESCRIPTION’ ...
─  preparing ‘polars’:
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts (876ms)
─  checking for empty or unneeded directories
─  building ‘polars_0.8.1.9000.tar.gz’

* installing *source* package ‘polars’ ...
** using staged installation
[1] "trying to install directly binary package!!!!"
[1] "installing directly from download binary package"
Installing package into ‘/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/00LOCK-polars/00new’
(as ‘lib’ is unspecified)
trying URL 'https://github.com/pola-rs/r-polars/releases/latest/download/polars__x86_64-apple-darwin20.tgz'
Content type 'application/octet-stream' length 18172308 bytes (17.3 MB)
==================================================
downloaded 17.3 MB

=== Hi there, throwing an error here on purpose here ===
ERROR: configuration failed for package ‘polars’
* removing ‘/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/polars’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/polars’
Warning message:
In i.p(...) :
  installation of package ‘/var/folders/v1/b2c26lpn2yjd997jg_gn4fgc0000gn/T//Rtmp5qh4qb/file97f83cad40af/polars_0.8.1.9000.tar.gz’ had non-zero exit status
eitsupi commented 8 months ago

We might risk sometime in future it would fail if URL changes.

I think the most worrisome thing about this is that we are installing from the GitHub fork repository of extendr. If the repository or the branch disappear, r-polars will no longer be able to be built.

https://github.com/pola-rs/r-polars/blob/68ca4414505c7eb2bd151e8b04b787b4421b3ba8/src/rust/Cargo.toml#L37-L40

If we can't find the binary library, it's not a big problem because you just fall back to the source build.

It still requires some Make and a compiler and linker to wrap up the package, right?

Of course it is. So if we want to do a binary installation, we need something else.

sorhawell commented 8 months ago

We own the rpolars organization where the extendr fork lives, so I don't see why we should delete it. We could add branch protection to the critical repositories and branches. I guess it could about as secure as r-polars.

That said I think I would also personally maintain some forks just to be sure.

Historically, I think, it was not ideal to depend directly on extendr, due their scope / aim was just not the same as ours. We might critically need to an extra feature or cherry pick some new changes. It is of course not ideal to stray to far away from extendr either.

sorhawell commented 8 months ago

Maybe it would be possible to set up a fairly cheap cran like R package repository which redirect download to github releases.

What is the protocol for client interactions with a package repository? I wonder if a github.io page could act as R repository and redirect gitub release

https://blog.sellorm.com/2019/03/29/lifting-the-lid-on-cran/ https://blog.sellorm.com/2019/03/30/build-your-own-cran-like-repo/

eitsupi commented 8 months ago

I wonder if a github.io page could act as R repository

I think this repo for the pak package did that. https://github.com/r-lib/r-lib.github.io

sorhawell commented 8 months ago

Nice. I'm also impressed by your background knowledge :)

eitsupi commented 8 months ago

I haven't used it so I couldn't comment right away, but it seems that we can host binary packages on GitHub Pages using the drat package by @eddelbuettel. https://eddelbuettel.github.io/drat/vignettes/dratstepbystep/

Perhaps it's better to deploy binary packages for amd64 macOS and Windows with drat, and recommend source + binary library installation for other platforms? (pak's binary releases are done by a dedicated function that is hard-coded the repository name prepared inside pak, and it seems difficult to use that.)

eddelbuettel commented 8 months ago

I wonder if a github.io page could act as R repository

Yes.

That is one of the key ideas behind drat:

Also note that r-universe is very similar in providing per-user repositories, and like it, drat can host binaries and source (but in difference to r-universe) will not build them for you.

eitsupi commented 7 months ago

I'm wondering if we could use pre-built binaries on R-universe, and it seems to be possible by detecting the environment variable MY_UNIVERSE. (https://github.com/r-universe-org/help/issues/75#issuecomment-1750197115) Perhaps we can distribute SIMD-enabled binary packages via R-universe by configuring them to use pre-built binaries when this is detected.

Of course drat is attractive, but I don't see much benefit in continuing to distribute binaries with SIMD disabled in R-universe, so perhaps it would be worth incorporating an R-universe-specific configuration?

eddelbuettel commented 7 months ago

Those would be questions for Jeroen. I am not sure how much you can influence what/how he builds. And given how much he builds reliably it is compelling.

eitsupi commented 7 months ago

Those would be questions for Jeroen.

This is simply a matter of whether scripts such as configure in the polars package detect the environment variable MY_UNIVERSE. It is the same as prqlr detecting NOT_CRAN and downloading the binary. https://github.com/eitsupi/prqlr/blob/67aed8cb89997486c991ea76e337654381cd7635/configure#L36-L66

The R-universe builder does not complain about what R packages download, so it is even possible to download the Rust nightly toolchain, for example. (The reason we don't do that now is because we didn't know how to tell if it was on the R-universe or not.)

eddelbuettel commented 7 months ago

Sure, conditioning on MY_UNIVERSE is easy enough and I do so in a package. I meant this more of a 'if you need details or want to clarify he is he one to ask' as in the 'how much you can influence' that we do not get to mod his yaml files. But fully agreed that configure is a valid package-side hook.

eitsupi commented 7 months ago

I thought the same thing could be done here and have recently added that functionality to prqlr (eitsupi/prqlr#195). The arm64 architecture can also be supported due to cargo's excellent cross-compilation, and the glibc version should be negligible by selecting the musl target. (e.g. #86)

When I tried in #435, it seems that some tests cannot pass when building the Rust library with the musl target. https://github.com/pola-rs/r-polars/actions/runs/6598220293/job/17926536231?pr=435#step:12:144

Details ```log Running examples in ‘polars-Ex.R’ failed The error most likely occurred in: > base::assign(".ptime", proc.time(), pos = "CheckExEnv") > ### Name: Expr_apply > ### Title: Expr_apply > ### Aliases: Expr_apply > ### Keywords: Expr > > ### ** Examples > > # apply over groups - normal usage > # s is a series of all values for one column within group, here Species > e_all = pl$all() # perform groupby agg on all columns otherwise e.g. pl$col("Sepal.Length") > e_sum = e_all$apply(\(s) sum(s$to_r()))$suffix("_sum") > e_head = e_all$apply(\(s) head(s$to_r(), 2))$suffix("_head") > pl$DataFrame(iris)$group_by("Species")$agg(e_sum, e_head) shape: (3, 9) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ Species ┆ Sepal.Len ┆ Sepal.Wid ┆ Petal.Len ┆ … ┆ Sepal.Len ┆ Sepal.Wid ┆ Petal.Len ┆ Petal.Wi │ │ --- ┆ gth_sum ┆ th_sum ┆ gth_sum ┆ ┆ gth_head ┆ th_head ┆ gth_head ┆ dth_head │ │ cat ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ ┆ f64 ┆ f64 ┆ f64 ┆ ┆ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64 │ │ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ] │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ versicolo ┆ 296.8 ┆ 138.5 ┆ 213.0 ┆ … ┆ [7.0, ┆ [3.2, ┆ [4.7, ┆ [1.4, │ │ r ┆ ┆ ┆ ┆ ┆ 6.4] ┆ 3.2] ┆ 4.5] ┆ 1.5] │ │ setosa ┆ 250.3 ┆ 171.4 ┆ 73.1 ┆ … ┆ [5.1, ┆ [3.5, ┆ [1.4, ┆ [0.2, │ │ ┆ ┆ ┆ ┆ ┆ 4.9] ┆ 3.0] ┆ 1.4] ┆ 0.2] │ │ virginica ┆ 329.4 ┆ 148.7 ┆ 277.6 ┆ … ┆ [6.3, ┆ [3.3, ┆ [6.0, ┆ [2.5, │ │ ┆ ┆ ┆ ┆ ┆ 5.8] ┆ 2.7] ┆ 5.1] ┆ 1.9] │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘ > > > # apply over single values (should be avoided as it takes ~2.5us overhead + R function exec time > # on a 2015 MacBook Pro) x is an R scalar > > # perform on all Float64 columns, using pl$all requires user function can handle any input type > e_all = pl$col(pl$dtypes$Float64) > e_add10 = e_all$apply(\(x) { + x + 10 + })$suffix("_sum") > # quite silly index into alphabet(letters) by ceil of float value > # must set return_type as not the same as input > e_letter = e_all$apply(\(x) letters[ceiling(x)], return_type = pl$dtypes$Utf8)$suffix("_letter") > pl$DataFrame(iris)$select(e_add10, e_letter) shape: (150, 8) ┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐ │ Sepal.Leng ┆ Sepal.Widt ┆ Petal.Leng ┆ Petal.Wid ┆ Sepal.Len ┆ Sepal.Wid ┆ Petal.Len ┆ Petal.Wid │ │ th_sum ┆ h_sum ┆ th_sum ┆ th_sum ┆ gth_lette ┆ th_letter ┆ gth_lette ┆ th_letter │ │ --- ┆ --- ┆ --- ┆ --- ┆ r ┆ --- ┆ r ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ --- ┆ str ┆ --- ┆ str │ │ ┆ ┆ ┆ ┆ str ┆ ┆ str ┆ │ ╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡ │ 15.1 ┆ 13.5 ┆ 11.4 ┆ 10.2 ┆ f ┆ d ┆ b ┆ a │ │ 14.9 ┆ 13.0 ┆ 11.4 ┆ 10.2 ┆ e ┆ c ┆ b ┆ a │ │ 14.7 ┆ 13.2 ┆ 11.3 ┆ 10.2 ┆ e ┆ d ┆ b ┆ a │ │ 14.6 ┆ 13.1 ┆ 11.5 ┆ 10.2 ┆ e ┆ d ┆ b ┆ a │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ 16.3 ┆ 12.5 ┆ 15.0 ┆ 11.9 ┆ g ┆ c ┆ e ┆ b │ │ 16.5 ┆ 13.0 ┆ 15.2 ┆ 12.0 ┆ g ┆ c ┆ f ┆ b │ │ 16.2 ┆ 13.4 ┆ 15.4 ┆ 12.3 ┆ g ┆ d ┆ f ┆ c │ │ 15.9 ┆ 13.0 ┆ 15.1 ┆ 11.8 ┆ f ┆ c ┆ f ┆ b │ └────────────┴────────────┴────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘ > > > ## timing "slow" apply in select /with_columns context, this makes apply > n = 1000000L > set.seed(1) > df = pl$DataFrame(list( + a = 1:n, + b = sample(letters, n, replace = TRUE) + )) > > print("apply over 1 million values takes ~2.5 sec on 2015 MacBook Pro") [1] "apply over 1 million values takes ~2.5 sec on 2015 MacBook Pro" > system.time({ + rdf = df$with_columns( + pl$col("a")$apply(\(x) { + x * 2L + })$alias("bob") + ) + }) user system elapsed 2.530 0.012 2.543 > > print("R lapply 1 million values take ~1sec on 2015 MacBook Pro") [1] "R lapply 1 million values take ~1sec on 2015 MacBook Pro" > system.time({ + lapply(df$get_column("a")$to_r(), \(x) x * 2L) + }) user system elapsed 1.333 0.056 1.392 > print("using polars syntax takes ~1ms") [1] "using polars syntax takes ~1ms" > system.time({ + (df$get_column("a") * 2L) + }) user system elapsed 0.003 0.000 0.002 > > > print("using R vector syntax takes ~4ms") [1] "using R vector syntax takes ~4ms" > r_vec = df$get_column("a")$to_r() > system.time({ + r_vec * 2L + }) user system elapsed 0.005 0.000 0.006 > > #' #R parallel process example, use Sys.sleep() to imitate some CPU expensive computation. > > # use apply over each Species-group in each column equal to 12 sequential runs ~1.2 sec. > pl$LazyFrame(iris)$group_by("Species")$agg( + pl$all()$apply(\(s) { + Sys.sleep(.1) + s$sum() + }) + )$collect() |> system.time() user system elapsed 0.027 0.000 1.221 > > # map in parallel 1: Overhead to start up extra R processes / sessions > pl$set_options(rpool_cap = 0) # drop any previous processes, just to show start-up overhead here > pl$set_options(rpool_cap = 4) # set back to 4, the default > pl$options$rpool_cap [1] 4 > pl$LazyFrame(iris)$group_by("Species")$agg( + pl$all()$apply(\(s) { + Sys.sleep(.1) + s$sum() + }, in_background = TRUE) + )$collect() |> system.time() Error: Error: Execution halted with the following contexts 0: In R: in $collect(): 0: During function call [system.time(pl$LazyFrame(iris)$group_by("Species")$agg(pl$all()$apply(function(s) { Sys.sleep(0.1) s$sum() }, in_background = TRUE))$collect())] 1: When waiting for the background R process to establish a job channel 2: Io(Custom { kind: ConnectionReset, error: ChannelClosed }) Timing stopped at: 0.007 0 0.688 Execution halted * checking for unstated dependencies in ‘tests’ ... OK * checking tests ... Running ‘testthat.R’ [29s/30s] [30s/30s] ERROR Running the tests in ‘tests/testthat.R’ failed. Last 13 lines of output: Error: Execution halted with the following contexts 0: In R: in $collect(): 0: During function call [test_check("polars")] 1: When waiting for the background R process to establish a job channel 2: Io(Custom { kind: ConnectionReset, error: ChannelClosed }) Backtrace: ▆ 1. └─pl$LazyFrame()$select(pl$lit(tmpf)$map(f_ipc_to_s, in_background = TRUE))$collect() at test-sink_stream.R:43:2 2. └─polars:::unwrap(collect_f(lf), "in $collect():") at polars/R/lazyframe__lazy.R:404:2 [ FAIL 3 | WARN 0 | SKIP 1 | PASS 1524 ] Error: Test failures Execution halted ```

So I think it's reasonable to lower the Ubuntu version and use the gnu target.