jupyterhub / repo2docker

Turn repositories into Jupyter-enabled Docker images
https://repo2docker.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.62k stars 362 forks source link

Support `install.R` files #24

Closed choldgraf closed 6 years ago

choldgraf commented 7 years ago

We should support R functionality. R handles dependencies differently from both Python and Julia, here are some thoughts from Carl:

Current proposal

Main to-do items

Notes

In an R package, the DESCRIPTION file plays the role of a requirements.txt in stating the dependencies, minimal version needed, and where get them (e.g. CRAN or additional cran-type repo like bioconductor).

This approach does not accommodate installing something that is not the most recent version of a package. (CRAN archives old sources, but because, unlike python or ruby gems distribution, CRAN is designed to provide binaries & you can't guarantee binaries build for an old /archived source, the default install does not immediately support installing archived packages).

If you just have a list of packages you want, I recommend something along the lines of what we do with rocker, e.g.

install2.r cat deps.txt

Where deps.txt is just a list of package names you want to install. If these come from multiple repos (cran & bioconductor), just list those as arguments to -r:

install2.r -r "https://cran.rstudio.com" -r "https://bioconductor.org/pagkages/release" cat deps.txt

If you want to install the same version each time, just use an MRAN snapshot of the appropriate date.

sje30 commented 7 years ago

Why not simply allow the user to include an R script that will install the relevant packages? e.g. something like:

https://github.com/sje30/waverepo/blob/master/paper/waverepo_installs.R

choldgraf commented 7 years ago

we're trying to keep the repo setup as simple as possible - it seems like allowing for arbitrary R (or python/julia/whatever) code as a prerequisite for building is a bit much. It'd be better to have something list-like in a file (similar to how python and Julia handle it) but we haven't found a great solution yet. The other challenge is that we want to be able to enforce consistency for an image. If you used a script such as the one above, is there a way to get a specific version of a package?

yuvipanda commented 7 years ago

Yeah, the underlying model is that a got commit should always build to the same reproducible docker image no matter when it is built. Without using something like MRAN snapshots this doesn't work with R. Even with it I don't know how it works with installing packages from outside of it (like off GitHub).

On Jun 30, 2017 8:39 AM, "Chris Holdgraf" notifications@github.com wrote:

we're trying to keep the repo setup as simple as possible - it seems like allowing for arbitrary R (or python/julia/whatever) code as a prerequisite for building is a bit much. It'd be better to have something list-like in a file (similar to how python and Julia handle it) but we haven't found a great solution yet. The other challenge is that we want to be able to enforce consistency for an image. If you used a script such as the one above, is there a way to get a specific version of a package?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jupyter/repo2docker/issues/24#issuecomment-312301249, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23iVIwrJd-YFaEGqPSQ6h0LMj33D6ks5sJRbEgaJpZM4Nuly8 .

sje30 commented 7 years ago

There is packrat, which might help: https://rstudio.github.io/packrat/

https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages

shows the following syntax for older versions of packages:

install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org")
choldgraf commented 7 years ago

Just chatted with Carl. Here are some thoughts:

  1. It sounds like 95% of the "official" packages in R are covered either by bioconductr or CRAN. These work in different ways as far as versioning.
    1. bioconductr essentially follows a "release" model, similar to something like Ubuntu. There are no intermediate versions for individual packages, only the state of each package at a particular release point of bioconductr.
    2. CRAN follows a dependency model that assumes that assumes that at any moment in time the most recent version of each package will play well with all of the other packages. It doesn't have a concept of "this package requires old_package version 1 and older package version 2.3beta". For this reason MRAN was created to build snapshots of CRAN on each day. That way you can say "give me the state of CRAN on august 1st, 2017".
  2. Packrat is not a repository, but is a tool that lets you sort of "collect" an R environment at a moment in time, and keep it for posterity so that it can be rebuilt exactly. It feels a bit of a heavy solution but it seems the most full-featured in terms of reproducibility.
  3. There's also a relatively new but useful-looking tool called ContainerIt. This will generate Dockerfiles from R scripts. So we might be able to use this in order to generate the Docker images that we could then feed through JupyterHub.

Either way, it seems like we sort of have a few options here:

  1. Support packrat. This will take some learning and development to figure out the right way to integrate it with the build.
  2. Support MRAN dates + bioconductr versions.
  3. Allow people to specify a dependencies file, and our script will loop through each line in that file and do install.packages("<package-name>", type="source",dependencies=TRUE,repos="http://a.cran.mirror")
  4. Infer number 3 based on the calls to library(<packagename>) in the .R files that are in the repo. Maybe with ContainerIt.

I don't think these 4 are mutually exclusive. I'd propose that we do some combination of 3 and 4 first, and tell users that for now they can't assume things like specific version numbers until we figure out 1 and 2.

@cboettig does this sound correct to you? feel free to tell me where I'm wrong here :-)

cboettig commented 7 years ago

👍

FWIW, we have this generally work out of the box if someone just uses a rocker image with a version specific tag, e.g. rocker/rstudio:3.4.1 (corresponding to R version 3.4.1) The CRAN repo is automatically set to the MRAN snapshot of last date that version was current. Since 3.4.1 is the most recent R release today, this means a user of that image will get the latest copies of whatever packages they install. But if they run the same scripts a few years later, they will get those same copies, matching R 3.4.1. The user just uses the same install.packages() function they always do, and never needs to pay any attention to repos or versions.

On the Bioconductor side this is even simpler, assuming users install bioconductor packages in the usual bioconductor way, using bioclite. That script checks the R version being run and gets the right packages, so by locking the R version, a user locks the version for all bioclite packages, while never having to think about it. Our method of setting MRAN defaults in the docker images basically lets us duplicate this behavior on the CRAN side. All system libraries come are pinned to stable apt sources (e.g. specifically named Debian releases instead of 'floating' aliases like testing or stable), so these will likewise be stable/replicable on re-builds of the stack years later.

Users that want a docker image that always has the latest version simply omit the version tag or specify latest, e.g. rocker/rstudio:latest. Any scripts deployed on that environment will always have access to the (built nightly) latest versions of all packages. (Likewise, rocker/rstudio:devel has the latest stable releases of packages but running on the nightly devel release of R, though this is mostly relevant to devs who are required to test packages on devel).

Users installing stuff from GitHub are mostly on their own as far as reproducibility goes (that's true everywhere) though packrat is one solution (a simpler solution is to include an @ hash or version tag in the install_github call in a script).

So my general recommendation to R users is just to run in a rocker environment; include the version tag when you want reproducibility, otherwise use latest, and don't worry too much. Users can add packrat or whatever on top, but the basic model is both pretty reproducible and pretty simple.

cboettig commented 7 years ago

Hey @choldgraf et al, you should definitely take a look at http://vincebuffalo.org/notes/2017/08/28/notes-on-anaconda.html, I think @vsbuffalo does a much better job explaining the precisely the kind of problems I was trying to communicate with regards to the system libraries.

choldgraf commented 7 years ago

thanks for the link @cboettig - definitely agree with the points in that article. There's a healthy debate within the python community about the pros and cons about something like anaconda's package manager. I think we should emphasize in the documentation that it's important not keep the channels used to 1, unless it's absolutely necessary (and mention that if you have N > 1, it may not guarantee reproducibility)

yuvipanda commented 7 years ago

Me and @cboettig just talked about this!

  1. Specify the R version as a date in runtime.txt. So a value like r-2017-01-21 needs to be set there to 'trigger' R. This sets up a version of R that was current at that date, and sets the MRAN snapshot for that date as the default repo. We'll also set up R kernel for jupyter to discover. This R will also make sure we are installing in a user-owned library path so stuff can be installed there at any point.
  2. If there's an install.r executable script, that is then executed with the installed R. This lets people write R code that installs packages. This seems to be the most common & accepted way to install libraries.

We can deal with packrat later. This current solution seems like a good start to supporting the R community on binder/repo2docker.

sje30 commented 7 years ago

I think this looks great.

My only minor concern is where the two files install.r and runtime.txt might live. Would you look for them in just the top directory of the project? I'm just aware that the R packaging system is fussy about extra files being located in a package. So, if someone is using the root of an R package as the root of the repo project, I'm not sure those two files could live in the top-level dir without some fiddling.

yuvipanda commented 7 years ago

@sje30 Currently those two files can live in the top dir of the git repo, or inside a dir called 'binder' if you don't want to clutter your top level dir. Do you think that suffices?

yuvipanda commented 7 years ago

The runtime.txt convention comes from Heroku (https://devcenter.heroku.com/articles/python-runtimes), for those who are curious.

sje30 commented 7 years ago

thanks -- yes, top level or in binder/ sounds like it would be great; I think .Rbuildignore can be used to tell R to ignore those files during a build

taylorreiter commented 7 years ago

Hello all -- working with install.r file now.

Can this be changed to install.R?

Additionally, how do you feel about naming the file install.packages.R, which would be concordant with the install.packages() installation function inside of R?

cboettig commented 7 years ago

+1 for install.R.

I have never been wild about . in file names other than as a file-type extension.

Also, users can use other commands than install.packages, e.g. devtools::install_github("pkgname) or biocLite("pkgname") for BioConductor (which is preferable to using install.packages(repo=..) for BioConductor packages since it makes package installation stable wrt the R version.)

taylorreiter commented 7 years ago

@cboettig yes I see how this could cause confusion/not be totally accurate.

+1 for only install.R!

choldgraf commented 7 years ago

Updated the top-level comment with the current proposal. I think it's quite close!

One question:

what happens if we detect install.R without detecting a r-YYYY-MM-DD line in runtime.txt? Just take today's date? Return a helpful error message saying we need both?

yuvipanda commented 7 years ago

Thanks @choldgraf. Yep, helpful error message.

choldgraf commented 7 years ago

ok, top-level comment updated to reflect this

willingc commented 6 years ago

@choldgraf @yuvipanda Do we want to close this? If not, let's add a specific next action here since a bunch of good work has happened with docs and binder-examples that I'm not sure what the next step is here.

choldgraf commented 6 years ago

I think that we should pivot this issue to be specifically about adding an R build pack (e.g. supporting install.R or something like this). Will change title to reflect this!