ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

An extremely lightweight packrat/checkpoint #74

Closed karthik closed 7 years ago

karthik commented 7 years ago

I teach a lot of workshops and it sucks to have lose the first hour or so waiting for everyone to get all the packages installed. That too on the inevitable slow internet connection. All this makes me want to cry.

So, I've had a vision for a package for a few months, that I started in early April with some help/advice from @jimhester. I call it rrequire.

What it does, essentially a pip freeze.

It will read a folder full of scripts, Rmd files, etc to parse out libraries from library calls, require calls, pkg::, and pkg:::. Parts of this exist elsewhere and Jim has already helped me pull some of this out from his lookup package.

The package should not take too long and I hope to find the time to finish it soon, but would love advice or collaborators if anyone is inclined to help (which means we can work on it at unconf).

What it will do:

So, why something so basic? I see this problem all the time. Everywhere. Even in RStudio con this year, I was watching Charlotte's purrr workshop take ~40 minutes to get set up, with thumb drives being passed around.0 With this package every single person will be able to follow along with the code, and examples and not get frustrated and demoralized.

I have a private repo at the moment (mostly because it doesn't yet install correctly and there are plenty of issues), but can open it up if there is interest for collaboration.

Also, feedback welcome if this is a terrible idea or if I should reconsider the plan.

karthik commented 7 years ago

referencing issue #5

jcheng5 commented 7 years ago

@kevinushey can weigh in if he has a different perspective, but I think we want to take packrat in this direction too. Packrat essentially supersets what you're describing:

I think where we (alright, I) went wrong was going a little overboard in making things as isolated and as reproducible as possible; what probably should have been reserved for a "strict" mode instead was the default/only behavior. We create a private package library for every project; this gives you isolation but costs disk space. We stash the source packages for every direct or indirect package dependency in the project directory; this ensures that you can install your packages even if CRAN and GitHub go down permanently or if you're offline, but costs disk space/bandwidth, greatly slows down the process of getting started, and slows down installs by forcing them to all be from source (versus installing binaries from CRAN).

This is (probably?) the appropriate set of tradeoffs for one important packrat scenario (freezing dependencies for reproducibly deploying R code to production environments), but not so appropriate for the kind of teaching or individual-researcher scenarios that are relevant to most end users. Like you say, something more lightweight would be better.

If I were you I'd be tempted to start a totally new package too, but, it might be worth talking about whether some minor changes to packrat could do the trick. In my ideal world, packrat should make it easy to choose whatever set of tradeoffs you want. For example, I spend a lot of time running problematic user code from shiny's mailing list or stack overflow, and often they require packages I've never heard of. I definitely want to use a private library, and I want all the dependencies to install automatically, but I don't particularly care about version numbers or reproducibility.

MilesMcBain commented 7 years ago

👍One extra thing to chuck in: I don't know if it's just an Aus thing but academics who are rocking Windows work laptops always have this setup where their home folder is on a synced network drive. The default location for their R lib goes there and causes all manner of problems. Might have been working fine till they moved off campus to attend a workshop.

Typical complaints are "but I installed this, why can't it find it" or "it just keeps installing but never works". Guaranteed I will spend the start of every workshop directing people to this stack overflow question and updating their lib path. Next step is reinstall all packages 😥...

So the ability to detect this situation and direct people to the remedy or even automate the remedy would be a real workshop timesaver.

jcheng5 commented 7 years ago

BTW, another solution we use a lot is preconfigured RStudio Server accounts for workshops. This works great... unless wifi goes down. As it did at rstudio::conf. Repeatedly.

But like I said, when it works, it works great. I don't know that anyone has posted instructions though on how to run this kind of setup effectively. In particular, how to provision user accounts and get credentials to people, if you don't know who's coming ahead of time.

stephlocke commented 7 years ago

I posted a docker based solution which does the user creds plus the rest of the setup

https://itsalocke.com/r-training-environment/

jimhester commented 7 years ago

Bioconductor uses Amazon Machine Instances with RStudio and Bioconductor packages pre-installed for the same purpose as well.

MilesMcBain commented 7 years ago

@stephlocke !!! That is so useful thankyou!

noamross commented 7 years ago

Can requirements.txt be basically put into DESCRIPTION? I'm mixed on how much of package structure I want to force into projects, but once Remotes broke the wall that DESCRIPTION doesn't always have to be CRAN compliant, it made me realize it could be useful for a lot of things. So you could have the packages in Imports/Suggests/Depends not only have minimum versions, but exact versions. (Or versions with MRAN dates). Over in #5 I'm liking the idea of DESCRIPTION having a field like ProjectType: Make/remake/etc

kevinushey commented 7 years ago

Here's a brain dump of the various issues that we've seen in Packrat -- some might help motivate decisions you make here...

  1. It's not only the package's own version that you need to worry about, but the versions of dependent packages that it was built against. This matters for things like LinkingTo: (what headers did this package compile against?) and occasionally for cached closures. Although issues of this form are rare, I do recall a big mess around reference classes where cached reference classes in certain packages no longer worked after the methods package was updated. Unfortunately, it's typically not possible to know what versions of dependent packages were used when a package was built...

  2. Headaches amplify on Windows. Just getting Rtools installed can often cost a lot of time, and a number of R packages require external libraries only available on the CRAN build servers (and so, would fail when attempting to build from sources locally). In other words, without package binaries (or access to package binaries), on Windows you might be hosed no matter what you try.

  3. Biconductor makes things more complicated, especially when you have users who are mixing packages from different versions of Bioconductor, or mixing release and development versions of packages, and so on. Although Bioconductor is at the heart just a collection of versioned CRAN-like repositories, there's still some fussing around with the BiocInstaller package that needs to happen for this to work. [I'm guessing this might be out of scope for what you're proposing, but it's worth keeping in mind]

  4. Compile times can skyrocket when Rcpp / stringi are involved, and Rcpp / stringi are almost always involved. There are also rare (but not rare enough) cases where packages that normally compile fine on most machines will fail because someone either has a super old version of gcc, or a super new version of gcc, or some stale CXXFLAGS hanging around somewhere that break package compilation in a weird way.

  5. Speaking about compilation, many R packages require certain system libraries, and fail to compile / build when those libraries are not available. This implies that your tool might need some way of encoding that particular packages depend on certain system libraries, and perhaps some strategies for resolving those dependencies.

  6. We made the mistake of being too strict about where a package can be restored from in Packrat. For example, if you install an R package from sources locally, then Packrat assumes that it should not attempt to restore that package from CRAN -- even if the same version of that package is available (either as the current release, or in the CRAN archives). Point being, depending on how 'strict' you want to be, you likely want to be somewhat lax in terms of 'where' a package can come from during installation.

  7. Early iterations of packrat also made the mistake of only recording packages in the user library at first. This is problematic for a number of reasons -- first being that users can (and do) install packages into the system library, and second being that Recommended R packages, while available by default on Windows / macOS R installations, may not be available on Linux.

  8. Some packages might have been installed with custom configure arguments / configure variables. There should ideally be some way to encode what these should be in the (user-generated or machine-generated) requirements.txt.

tl;dr: I think asking students to install packages from sources for a workshop environment will inevitably end in heartbreak. There's basically two ways out of this IMHO:

That said, I do think a package rrequire could have enormous value; I just also think that no matter what an 'install from sources' strategy is going to cause hard-to-solve problems.

karthik commented 7 years ago

Wow, thanks for all the really useful input. Will digest and respond in detail.

BTW, another solution we use a lot is preconfigured RStudio Server accounts for workshops.

This is what we've been using for ropensci workshops since 2014. It works really really well and I even have printed strips of logins ready to go. However, the mission of various training workshops (including data and software carpentry, and I imagine Jenny's stat 545) is that everyone have everything working on their own machines no matter how big the struggle. That way they take everything with them to continue their learning.

Someone who cannot do some basic data manipulation or visualization is not going to spin up AMIs or use docker images.

@kevinushey Thank you! This is super helpful and I'm in awe of all the work that has gone into packrat. I love the idea of providing binaries, but see that this isn't trivial.

karthik commented 7 years ago

I just also think that no matter what an 'install from sources' strategy is going to cause hard-to-solve problems.

The last bit (sources available on a local room-only network) is a idea I tacked on. I see the value mostly as having people install from CRAN/GitHub using the requirements file and with a strong internet connection before they arrive at a workshop. The more we can do to have people hit the ground running, the smoother workshops go.

I still plan to use RStudio servers where possible, but for some workshops, we really need people to have the entire setup locally for them to continue their work.

@jcheng5 I am all for improving packrat if that obviates the need for a package like this!

cboettig commented 7 years ago

Great thread. Super excited to hear @jcheng5 @kevinushey that packrat is looking at these things, that's precisely why I've never been able to use packrat in my own work (i.e. it is too strict and I work frequently on linux platforms (no binaries on cran), meaning packrat was constantly re-installing a lot of my package suite from source which is unworkable. Still, I think there's huge value in the packrat manifest idea to provide a scripted way to specify a particular environment of packages with greater precision than one can manage by just using a DESCRIPTION file.

As an aside, for the reproducibility angle I've found it generally simplest and sufficiently reliable to simply point my default CRAN mirror to the appropriate MRAN snapshot date. Sure this only works for packages on CRAN, and yes it assumes you're not recreating environments where you had some packages up-to-date with CRAN at the time and some years out of date, but that latter situation is dubious anyway. We've also used this MRAN approach in the versioned R docker images in rocker: https://github.com/rocker-org/rocker-versioned .

Of course the reproducibility issues that packrat or MRAN address is somewhat orthogonal to @karthik's workshop question where I think the only concern is to quickly install a recent environment. In practice I've found R already does this rather well if I can just restrict my instruction to some core suite like tidyverse, though sounds like others with more experience have seen that crash and burn (I could see 40 min installs with headaches for novice users on linux systems, but thought that one should be pretty clean out-of-the-box on windows and mac?)

Obviously installing from sources / github packages is another problem entirely, and I agree with the sentiment of @kevinushey that there be dragons. If installing packages from source is going to be expected, I think it requires some instruction on the process and potentially hairy details -- I think leaving a workshop where you got everything installed and working locally but have no idea how to do it again is just a recipe for trouble down the road.

A nice feature of docker in workshops is that it (usually*, older pcs can be trouble) can be run on the student's own machine; and with RStudio and linked volumes it feels a lot more native than some remote AMI. Obviously download sizes can be large and launching RStudio into a browser from docker can be pretty alien, (though it's gone smoothly in my classes). This doesn't teach students anything about managing their own local installs, but it does teach them a strategy that can be reproduced on a colleague's machine or a cloud server. It also avoids telling students to upgrade any and all packages they might currently have installed on their local system, which can break their existing work.

rmflight commented 7 years ago

I seem to recall someone started a package that would create a local CRAN instance that could essentially be served from a USB stick, that would also allow you to select which packages you wanted to have available. It would probably be very useful for this type problem, coupled with a script to generate a list of the packages that are needed.

Unfortunately, I've forgotten the name of the project, but I seem to recall @hrbrmstr commenting on it as well, maybe he recollects it??

benmarwick commented 7 years ago

@rmflight, miniCRAN is a pkg like that. I've used it for the situation @karthik descibed (and wrote a little about it here, my setup and install scripts are here).

More recently, I find this wouldn't work with the sf pkg and OSX, so I ditched miniCRAN and simply having learners install the pkg binaries for their platform from a USB (which worked great except for the one Linux user in that group). I wrote about this here, script is here.

That second method is about a bare-bones as possible (and easily allows for non-CRAN pkgs) since it requires no contributed pkgs. But there's definitely scope to make the process more user-friendly, saving time with solving inevitable install errors. I especially love @karthik's idea of a check (e.g. for out of date pkgs) and an error log. Could be a role here for the glue pkg!

haozhu233 commented 7 years ago

Assuming the internet is all right, maybe we can let instructor put a list of package names (basically the requirements.txt) to a gist (or similar places) and then have a install_gist() function to facilitate installation? Or if we want to go further, the instructor can put that gist link to a bit.ly and we have a install_bitly() function that does the jumping?

jcheng5 commented 7 years ago

TIL about the pkgsnap package by @gaborcsardi, seems much lighter weight than packrat (but it snapshots your entire library, not just what's being used in your project).

hadley commented 7 years ago

We've also talked a bit about moving packrat to my team and putting some more resources (i.e. @jimhester & @gaborcsardi 😉) behind it.