Closed choldgraf closed 6 years ago
Why not simply allow the user to include an R script that will install the relevant packages? e.g. something like:
https://github.com/sje30/waverepo/blob/master/paper/waverepo_installs.R
we're trying to keep the repo setup as simple as possible - it seems like allowing for arbitrary R (or python/julia/whatever) code as a prerequisite for building is a bit much. It'd be better to have something list-like in a file (similar to how python and Julia handle it) but we haven't found a great solution yet. The other challenge is that we want to be able to enforce consistency for an image. If you used a script such as the one above, is there a way to get a specific version of a package?
Yeah, the underlying model is that a got commit should always build to the same reproducible docker image no matter when it is built. Without using something like MRAN snapshots this doesn't work with R. Even with it I don't know how it works with installing packages from outside of it (like off GitHub).
On Jun 30, 2017 8:39 AM, "Chris Holdgraf" notifications@github.com wrote:
we're trying to keep the repo setup as simple as possible - it seems like allowing for arbitrary R (or python/julia/whatever) code as a prerequisite for building is a bit much. It'd be better to have something list-like in a file (similar to how python and Julia handle it) but we haven't found a great solution yet. The other challenge is that we want to be able to enforce consistency for an image. If you used a script such as the one above, is there a way to get a specific version of a package?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jupyter/repo2docker/issues/24#issuecomment-312301249, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23iVIwrJd-YFaEGqPSQ6h0LMj33D6ks5sJRbEgaJpZM4Nuly8 .
There is packrat, which might help: https://rstudio.github.io/packrat/
https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages
shows the following syntax for older versions of packages:
install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org")
Just chatted with Carl. Here are some thoughts:
old_package version 1
and older package version 2.3beta
". For this reason MRAN was created to build snapshots of CRAN on each day. That way you can say "give me the state of CRAN on august 1st, 2017".Either way, it seems like we sort of have a few options here:
install.packages("<package-name>", type="source",dependencies=TRUE,repos="http://a.cran.mirror")
library(<packagename>)
in the .R files that are in the repo. Maybe with ContainerIt.I don't think these 4 are mutually exclusive. I'd propose that we do some combination of 3 and 4 first, and tell users that for now they can't assume things like specific version numbers until we figure out 1 and 2.
@cboettig does this sound correct to you? feel free to tell me where I'm wrong here :-)
👍
FWIW, we have this generally work out of the box if someone just uses a rocker image with a version specific tag, e.g. rocker/rstudio:3.4.1
(corresponding to R version 3.4.1) The CRAN repo is automatically set to the MRAN snapshot of last date that version was current. Since 3.4.1 is the most recent R release today, this means a user of that image will get the latest copies of whatever packages they install. But if they run the same scripts a few years later, they will get those same copies, matching R 3.4.1. The user just uses the same install.packages()
function they always do, and never needs to pay any attention to repos or versions.
On the Bioconductor side this is even simpler, assuming users install bioconductor packages in the usual bioconductor way, using bioclite
. That script checks the R version being run and gets the right packages, so by locking the R version, a user locks the version for all bioclite packages, while never having to think about it. Our method of setting MRAN defaults in the docker images basically lets us duplicate this behavior on the CRAN side. All system libraries come are pinned to stable apt sources (e.g. specifically named Debian releases instead of 'floating' aliases like testing
or stable
), so these will likewise be stable/replicable on re-builds of the stack years later.
Users that want a docker image that always has the latest version simply omit the version tag or specify latest
, e.g. rocker/rstudio:latest
. Any scripts deployed on that environment will always have access to the (built nightly) latest versions of all packages. (Likewise, rocker/rstudio:devel
has the latest stable releases of packages but running on the nightly devel release of R, though this is mostly relevant to devs who are required to test packages on devel).
Users installing stuff from GitHub are mostly on their own as far as reproducibility goes (that's true everywhere) though packrat is one solution (a simpler solution is to include an @
hash or version tag in the install_github
call in a script).
So my general recommendation to R users is just to run in a rocker environment; include the version tag when you want reproducibility, otherwise use latest, and don't worry too much. Users can add packrat or whatever on top, but the basic model is both pretty reproducible and pretty simple.
Hey @choldgraf et al, you should definitely take a look at http://vincebuffalo.org/notes/2017/08/28/notes-on-anaconda.html, I think @vsbuffalo does a much better job explaining the precisely the kind of problems I was trying to communicate with regards to the system libraries.
thanks for the link @cboettig - definitely agree with the points in that article. There's a healthy debate within the python community about the pros and cons about something like anaconda's package manager. I think we should emphasize in the documentation that it's important not keep the channels used to 1, unless it's absolutely necessary (and mention that if you have N > 1, it may not guarantee reproducibility)
Me and @cboettig just talked about this!
r-2017-01-21
needs to be set there to 'trigger' R. This sets up a version of R that was current at that date, and sets the MRAN snapshot for that date as the default repo. We'll also set up R kernel for jupyter to discover. This R will also make sure we are installing in a user-owned library path so stuff can be installed there at any point.install.r
executable script, that is then executed with the installed R. This lets people write R code that installs packages. This seems to be the most common & accepted way to install libraries.We can deal with packrat later. This current solution seems like a good start to supporting the R community on binder/repo2docker.
I think this looks great.
My only minor concern is where the two files install.r
and runtime.txt
might live. Would you look for them in just the top directory of the project? I'm just aware that the R packaging system is fussy about extra files being located in a package. So, if someone is using the root of an R package as the root of the repo project, I'm not sure those two files could live in the top-level dir without some fiddling.
@sje30 Currently those two files can live in the top dir of the git repo, or inside a dir called 'binder' if you don't want to clutter your top level dir. Do you think that suffices?
The runtime.txt convention comes from Heroku (https://devcenter.heroku.com/articles/python-runtimes), for those who are curious.
thanks -- yes, top level or in binder/ sounds like it would be great; I think .Rbuildignore can be used to tell R to ignore those files during a build
Hello all -- working with install.r
file now.
Can this be changed to install.R
?
Additionally, how do you feel about naming the file install.packages.R
, which would be concordant with the install.packages()
installation function inside of R?
+1 for install.R
.
I have never been wild about .
in file names other than as a file-type extension.
Also, users can use other commands than install.packages
, e.g. devtools::install_github("pkgname)
or biocLite("pkgname")
for BioConductor (which is preferable to using install.packages(repo=..)
for BioConductor packages since it makes package installation stable wrt the R version.)
@cboettig yes I see how this could cause confusion/not be totally accurate.
+1 for only install.R
!
Updated the top-level comment with the current proposal. I think it's quite close!
One question:
what happens if we detect install.R
without detecting a r-YYYY-MM-DD
line in runtime.txt
? Just take today's date? Return a helpful error message saying we need both?
Thanks @choldgraf. Yep, helpful error message.
ok, top-level comment updated to reflect this
@choldgraf @yuvipanda Do we want to close this? If not, let's add a specific next action here since a bunch of good work has happened with docs and binder-examples that I'm not sure what the next step is here.
I think that we should pivot this issue to be specifically about adding an R build pack (e.g. supporting install.R
or something like this). Will change title to reflect this!
We should support R functionality. R handles dependencies differently from both Python and Julia, here are some thoughts from Carl:
Current proposal
runtime.txt
to specify the R version as a date.r-YYYY-MM-DD
. E.g.2017-01-21
.install.R
(must be executable) to execute with the installed R kernel (in the point above). This primarily lets people write R code that installs packages.install.R
is given without ar-YYYY-MM-DD
line inruntime.txt
then we raise an error.Main to-do items
install.R
orpostBuild
or something.Notes