ropensci / rix

Reproducible Data Science environments for R with Nix
https://docs.ropensci.org/rix/
GNU General Public License v3.0
151 stars 13 forks source link

bootstrap helper from renv.lock #5

Open philipp-baumann opened 1 year ago

philipp-baumann commented 1 year ago

Hey @b-rodrigues , I thought it would be nice for people to be able to transition from a {renv} env to a nix setup in individual projects. For this, a helper that simply parses JSON in renv.lock and grabs the packages for the rix::rix() r_pkgs arg would be straight-forward to implement. If useful, I am happy to draft a PR addressing it.

b-rodrigues commented 1 year ago

that would be an amazing addition please feel free to code it up 😁

philipp-baumann commented 1 year ago

maybe I will wait until you commit your recent drafts into new remote branch :-)

b-rodrigues commented 1 year ago

I think a challenge here is to first list only the required packages that are actually called by the user using library(xxx) and not all of xxx dependencies, because these would need to got into propagatedBuildInputs. Packages would need to get downloaded from the CRAN archives. We can use fetchzip() to do so: https://github.com/b-rodrigues/rix/blob/d505382ad8ef759a8257227a19043e228a6d70dd/R/find_rev.R#L142

If we have the list of packages that are actually called by the user, and pass that to fetchzip() (or rather, fetchzips() which handles multiple packages see: https://github.com/b-rodrigues/rix/blob/d505382ad8ef759a8257227a19043e228a6d70dd/R/find_rev.R#L203

then that should be simple enough I think.

philipp-baumann commented 1 year ago

I'll also have a look at how {renv} implements the snapshotting.

philipp-baumann commented 4 months ago

directionality: renv => nix

philipp-baumann commented 4 months ago

way of interfacing renv: via requireNamespace(), eventually have it in Suggests field.

RichardJActon commented 3 days ago

I implemented the 'dumbest' version of this in my fork for personal use generating default.nix files from projects with renv.lock files. For the couple of projects with big renv.lock files I've tried it out on I've had to delete any packages installed from github from the list as these, expectedly, cause build errors. Also the gt package which seems to be failing because of it's dependency of V8.

https://github.com/RichardJActon/rix/commit/706127425f10a38155cd1d9785641d2790b2f313

This may be something for it's own issue but here are the logs for the V8 build issue:

building '/nix/store/w4n5szzkapj921gfgf5yynijdvc2r15r-r-V8-5.0.0.drv'...
Running phase: unpackPhase
unpacking source archive /nix/store/92nvrcn6w8k2ylbrjzmd3lalwri0c4mv-V8_5.0.0.tar.gz
source root is V8
setting SOURCE_DATE_EPOCH to timestamp 1723808402 of file V8/MD5
Running phase: patchPhase
Running phase: updateAutotoolsGnuConfigScriptsPhase
Running phase: configurePhase
patching script interpreter paths in configure
Running phase: buildPhase
Running phase: checkPhase
Running phase: installPhase
* installing *source* package 'V8' ...
file 'configure' has the wrong MD5 checksum
** using staged installation
Found C++20 compiler: /nix/store/zznja5f8v3jafffyah1rk46vpfcn38dv-gcc-wrapper-13.3.0/bin/c++
Found INCLUDE_DIR and/or LIB_DIR!
Using CXXCPP=/nix/store/zznja5f8v3jafffyah1rk46vpfcn38dv-gcc-wrapper-13.3.0/bin/c++ -std=gnu++20 -E
Using PKG_CFLAGS=-I/nix/store/bsqha32xvv1j20bwkhr2wsnclk4qsi3a-nodejs-20.17.0-libv8/include -I/usr/include/v8
Using PKG_LIBS=-L/nix/store/bsqha32xvv1j20bwkhr2wsnclk4qsi3a-nodejs-20.17.0-libv8/lib -lv8
Running feature test for pointer compression...
/nix/store/81xsp348yfgmaan9r5055mcdjfw7a8wc-binutils-2.42/bin/ld: /build/ccDtqrCX.o: in function `main':
/build/V8/tools/test.cpp:7:(.text.startup+0x36): undefined reference to `v8::platform::NewDefaultPlatform(int, v8::platform::IdleTaskSupport, v8::platform::InProcessStackDumping, std::unique_ptr<v8::TracingController, std::default_delete<v8::TracingController> >)'
/nix/store/81xsp348yfgmaan9r5055mcdjfw7a8wc-binutils-2.42/bin/ld: /build/V8/tools/test.cpp:8:(.text.startup+0x48): undefined reference to `v8::V8::InitializePlatform(v8::Platform*)'
/nix/store/81xsp348yfgmaan9r5055mcdjfw7a8wc-binutils-2.42/bin/ld: /build/ccDtqrCX.o: in function `v8::V8::Initialize()':
/nix/store/bsqha32xvv1j20bwkhr2wsnclk4qsi3a-nodejs-20.17.0-libv8/include/v8-initialization.h:104:(.text.startup+0x4f): undefined reference to `v8::V8::Initialize(int)'
collect2: error: ld returned 1 exit status
Pointer compression not needed
/nix/store/81xsp348yfgmaan9r5055mcdjfw7a8wc-binutils-2.42/bin/ld: /build/ccHf8mQK.o: in function `main':
/build/V8/tools/version.cpp:5:(.text.startup+0x2): undefined reference to `v8::V8::GetVersion()'
collect2: error: ld returned 1 exit status
./configure: line 154: ./v8version: not found
** libs
using C++ compiler: 'g++ (GCC) 13.3.0'
using C++20
rm -f V8.so RcppExports.o bindings.o
/nix/store/zznja5f8v3jafffyah1rk46vpfcn38dv-gcc-wrapper-13.3.0/bin/c++ -std=gnu++20 -I"/nix/store/z70v7siag32vkpwapfzmj84nw6y43gh9-R-4.4.1/lib/R/include" -DNDEBUG -I/nix/store/bsqha32xvv1j20bwkhr2wsnclk4qsi3a-nodejs-20.17.0-libv8/include -I/usr/include/v8 -DV8_ENABLE_CHECKS -I'/nix/store/6a7vz17pibzimdjy4js648r5jvc7v5jl-r-Rcpp-1.0.13/library/Rcpp/include'    -fvisibility=hidden -fpic  -g -O2   -c RcppExports.cpp -o RcppExports.o
/nix/store/zznja5f8v3jafffyah1rk46vpfcn38dv-gcc-wrapper-13.3.0/bin/c++ -std=gnu++20 -I"/nix/store/z70v7siag32vkpwapfzmj84nw6y43gh9-R-4.4.1/lib/R/include" -DNDEBUG -I/nix/store/bsqha32xvv1j20bwkhr2wsnclk4qsi3a-nodejs-20.17.0-libv8/include -I/usr/include/v8 -DV8_ENABLE_CHECKS -I'/nix/store/6a7vz17pibzimdjy4js648r5jvc7v5jl-r-Rcpp-1.0.13/library/Rcpp/include'    -fvisibility=hidden -fpic  -g -O2   -c bindings.cpp -o bindings.o
/nix/store/zznja5f8v3jafffyah1rk46vpfcn38dv-gcc-wrapper-13.3.0/bin/c++ -std=gnu++20 -shared -L/nix/store/z70v7siag32vkpwapfzmj84nw6y43gh9-R-4.4.1/lib/R/lib -o V8.so RcppExports.o bindings.o -L/nix/store/bsqha32xvv1j20bwkhr2wsnclk4qsi3a-nodejs-20.17.0-libv8/lib -lv8 -L/nix/store/z70v7siag32vkpwapfzmj84nw6y43gh9-R-4.4.1/lib/R/lib -lR
installing to /nix/store/m5l59cnkq6042d6jf7bnph0jwmq5g068-r-V8-5.0.0/library/00LOCK-V8/00new/V8/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'V8' in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/nix/store/m5l59cnkq6042d6jf7bnph0jwmq5g068-r-V8-5.0.0/library/00LOCK-V8/00new/V8/libs/V8.so':
  /nix/store/m5l59cnkq6042d6jf7bnph0jwmq5g068-r-V8-5.0.0/library/00LOCK-V8/00new/V8/libs/V8.so: undefined symbol: _ZN2v815ArrayBufferView9CheckCastEPNS_5ValueE
Error: loading failed
Execution halted
ERROR: loading failed
* removing '/nix/store/m5l59cnkq6042d6jf7bnph0jwmq5g068-r-V8-5.0.0/library/V8'
error: builder for '/nix/store/w4n5szzkapj921gfgf5yynijdvc2r15r-r-V8-5.0.0.drv' failed with exit code 1;
       last 10 log lines:
       > ** building package indices
       > ** installing vignettes
       > ** testing if installed package can be loaded from temporary location
       > Error: package or namespace load failed for 'V8' in dyn.load(file, DLLpath = DLLpath, ...):
       >  unable to load shared object '/nix/store/m5l59cnkq6042d6jf7bnph0jwmq5g068-r-V8-5.0.0/library/00LOCK-V8/00new/V8/libs/V8.so':
       >   /nix/store/m5l59cnkq6042d6jf7bnph0jwmq5g068-r-V8-5.0.0/library/00LOCK-V8/00new/V8/libs/V8.so: undefined symbol: _ZN2v815ArrayBufferView9CheckCastEPNS_5ValueE
       > Error: loading failed
       > Execution halted
       > ERROR: loading failed
       > * removing '/nix/store/m5l59cnkq6042d6jf7bnph0jwmq5g068-r-V8-5.0.0/library/V8'
       For full logs, run 'nix log /nix/store/w4n5szzkapj921gfgf5yynijdvc2r15r-r-V8-5.0.0.drv'.
b-rodrigues commented 3 days ago

Hey that's pretty neat, I actually wanted to implement exactly this, just listing the packages from an renv.lock file without handling any versions, and call this the "fast" method (while handling github packages though). And then have an "accurate" method that would, ideally, deal with versions.

As for V8, thanks for letting us know, I confirm it's currently broken, and you don't seem to be the only one bitten by it: https://github.com/NixOS/nixpkgs/issues/339861

the fix will soon be merged to master, in the meantime you should be able to use the recommend commit from that thread as the r_ver from the call to rix().

RichardJActon commented 3 days ago

Thanks for the info on V8 @b-rodrigues. Happy clean it up a little and open a PR if you want the "fast" approach in upstream while people figure out the details of a more complete solution.

should be pretty easy to add a check for github remotes that removes them by default and throws a warning.

b-rodrigues commented 3 days ago

I’d be more than happy! But I don’t think we should remove the packages, but handle them instead. For example, here is the latest echarts4r from Github:

    "echarts4r": {
      "Package": "echarts4r",
      "Version": "0.4.5.9000",
      "Source": "GitHub",
      "RemoteType": "github",
      "RemoteHost": "api.github.com",
      "RemoteRepo": "echarts4r",
      "RemoteUsername": "JohnCoene",
      "RemoteRef": "HEAD",
      "RemoteSha": "063883bbb71584bdb651f20be6f2bb34d55c9f6a",
      "Requirements": [
        "R",
        "broom",
        "corrplot",
        "countrycode",
        "dplyr",
        "htmltools",
        "htmlwidgets",
        "jsonlite",
        "purrr",
        "rstudioapi",
        "scales",
        "shiny"
      ],
      "Hash": "7062f4dffb9a9b144c7f690cf2d214be"
    },

We need to provide "package_name", "repo_url" and "commit" to git_pkgs. Using "RemoteType": "github", "RemoteRepo": "echarts4r" and "RemoteUsername": "JohnCoene" we can build the repo_url, we can use the "RemoteSha" as the commit and "Package" as the package_name. We could then provide this as well.

RichardJActon commented 3 days ago

OK interesting I'm still a nix packaging noob - easy enough to extract that repo info from the renv.lock file but I'm not certain what to do with it after that...

I added a function to get the R version from the lock file and, for now, an option to filter out packages the source of which is not "Repository" to my fork.

b-rodrigues commented 3 days ago

You can pass github packages to rix() like so:

rix(
  r_ver = "4.2.1",
  r_pkgs = c("dplyr", "janitor"),
  git_pkgs = list(
    list(
      package_name = "housing",
      repo_url = "https://github.com/rap4all/housing/",
      commit = "1c860959310b80e67c41f7bbdc3e84cef00df18e"
    ),
    list(
      package_name = "fusen",
      repo_url = "https://github.com/ThinkR-open/fusen",
      commit = "d617172447d2947efb20ad6a4463742b8a5d79dc"
    )
  ),
  ide = "other",
  project_path = path_default_nix,
  overwrite = TRUE
)

rix() takes care of generating the correct Nix expression. So we need to extract this info about the Github packages from the renv.lock files to pass it down to rix().

RichardJActon commented 2 days ago

Ah cool I missed that.

The question is then what is the best interface for this? Processing the renv.lock file produces values that can be passed to 4 different arguments of rix(): r_ver, r_pkgs, git_pkgs, and local_r_pkgs. The latter 3 are all from processing the package information.

Following on with my current approach I would make 4 functions 1 for each argument, this seems cumbersome, but doesn't involve changing the behavior of rix() at all, so is nice for backwards compatibility.

Alternately an argument could be added to rix() to supply a path to an renv.lock file which automatically sets all these values based on the file, This seems nicer and more efficient.

Then what would be the desired behavior of this argument if it is set and one of the existing ones is also set? Should the existing arguments supersede the values supplied by the renv.lock file? and/or should some kind of exception/message be thrown if both are supplied?.

Maybe add 3 arguments to rix():

Adding use_renv_lock allows the default path to be set without defaulting to using renv.lock and preserving the current behavior. Alternatively to keep the number of arguments down we could go with leaving renv_lock_path unset by default and only use renv.lock if the user supplies the path, or go with something like renv_lock which will accept TRUE or a path and default to the current directory if TRUE is passed.

If allow_renv_overrides is TRUE r_ver, r_pkgs, git_pkgs, and local_r_pkgs can override values inferred from the renv.lock else setting use_renv_lock and any of these others throws an error saying to pick one or the other or allow overrides.

b-rodrigues commented 2 days ago

I think this should be a completely separate function called renv2nix() which uses rix() under the hood to generate the Nix expression. Packages should only be passed to r_pkgs unless they come from github, or are local (but does renv.lock list local packages?)

RichardJActon commented 2 days ago

OK sounds good - what do you think of that function having arguments which allow you to override things from the renv file or do you think it best to discourage this by design?

I believe renv can capture local packages - I tried it out a while ago when I was using a locally developed package in another project I think it just stored the path somewhere I'll check how it did it exactly.

b-rodrigues commented 2 days ago

I don't think we should override the renv.lock file, but perhaps we should generate the actual R call to rix() as well, so people could change that if they want

RichardJActon commented 2 days ago

As an option to facilitate debugging or as the default behavior? - I'd lean towards the former.

b-rodrigues commented 2 days ago

Agreed

RichardJActon commented 2 days ago

I've got a version of renv2nix() working in my fork, including an option to return the rix call it generates, it can handle github/gitlab remotes but I've not played with any of the other remote types.

When renv installs a local package it's expecting a path to a local source folder not a compressed built package like local_r_pkg seems to so not quite sure how to make those two play nice.

I need to write some tests to check it handles things OK when the inputs stray from the 'happy path' but the basics are there.

b-rodrigues commented 2 days ago

Looks very nice, that's amazing!

If you allow me a little pre-review:

Thank you once again, this looks really promising!

mihem commented 2 days ago

Sorry to interrupt your discussion. I just wanted to say that i think that this conversion from renv to rix is super important for many users I think. I am reading your book Bruno about reproducible pipelines in R (which is great and very much needed for reproducible data science in R ) and came across this issue. I tried rix on a small project and it worked great. But I have an existing project (where I used renv to increase reproducibility) and which has > 200 libraries. From my understanding I would need to manually write down all packages (minus dependencies)? Letting a function handle that sounds much better or is there an easier already existing way that I am missing? Using renv with docker this is easy, because you can just run renv::restore in the Dockerfile. But then building this image takes some time (and usually in the process of analysis I need to add several libraries sequentially and would need to rebuilt it several times then).

RichardJActon commented 1 day ago

I've added the method argument and factored out most of git_pkgs processing to another function. Because the remote type is listed in the renv entry for the package I factored out a function where you can assert the type of remote you are expecting but which defaults to using the one documented for a given package. It returns not the list you would give to git_pkgs but an item in the list that you would give to git_pkgs. I use a loop in renv2nix() to iterate over the packages from non-primary repository sources with this function and assemble the list to hand to git_pkgs.

@mihem all I'm really doing here with this "fast" renv2nix() is getting the list of all the r packages in an renv.lock file and passing this to the r_pkgs argument of rix(). It's only marginally more sophisticated than doing this:

rix(r_pkgs = names(jsonlite::read_json("renv.lock")$Packages))

because I've added the ability to detect and attempt to build packages recorded as coming from github and gitlab remotes. Its not checking that the versions of the packages installed from nix match those in the renv.lock file - I think that will take quite a lot more work.

This explicitly installs all of the transitive dependencies of the primary dependencies of your project. These would be installed anyway but the transitive dependencies could be implicit as they are represented in the nix derivations of the primary dependencies. So whilst you don't need to explicitly list them it doesn't really do any harm to do so. At least a I understand it.