jupyterhub / repo2docker

Turn repositories into Jupyter-enabled Docker images
https://repo2docker.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.62k stars 360 forks source link

Investigate methods of making R builds faster #412

Closed choldgraf closed 4 years ago

choldgraf commented 6 years ago

I recently spoke with @karthik who mentioned that our R builds (with install_packages) seems to be going really slowly. There could be a couple problems which I'll list here:

It's possible to install some R packages in Ubuntu much faster by installing binaries. We could recommend this in the documentation for specifying R packages and such...

relevant blog post: http://dirk.eddelbuettel.com/blog/2017/12/13/

old points:

1. mybinder.org may not have enough RAM which is causing the build to be really slow for certain packages (like the tidyverse). Apparently many R packages have intermediate steps during install that use multiple gigs of RAM. 2. We aren't using some binary packages even though they are available. repo2docker seems to be building everything from source, even though for some packages there are binaries out there. We could investigate to see if this is an option!

betatim commented 6 years ago

From talking to R users it seems that binary packages for linux aren't a thing with CRAN?!? Pointers and contributions would be welcome as it is pretty frustrating to see all these packages being built from source over and over again.

https://www.rdocumentation.org/packages/utils/versions/3.5.1/topics/install.packages mentions binary packages (search for/scroll down to the "binary packages" heading) but also says they don't exist for linux. This seems to be the reason repo2docker compiles stuff from scratch. Not sure how we fix that or why there isn't an equivalent to the manylinux wheels for Python packages. Overall I think more R expertise would be welcome ;)

karthik commented 6 years ago

Here is a relevant post: http://dirk.eddelbuettel.com/blog/2017/12/13/

betatim commented 6 years ago

Thanks. We will need to update our instructions to R users to install APT packages instead of R packages. And potentially check where the packages get installed to and if that will require any more gymnastics with the environment variables for the search paths of R packages.

choldgraf commented 6 years ago

@betatim +1 to that - I think we should add an example (maybe r-fast?) and also update the documentation here: https://repo2docker.readthedocs.io/en/latest/howto/languages.html

betatim commented 6 years ago

https://github.com/betatim/dockerfile-r/tree/use-apt-instead (diff) seems to work but shiny is now broken. I assume there are some dependencies needed that are part of the tidyverse or something?

ryanlovett commented 6 years ago

I've also dealt with slow R-focused image builds. Using distro packages helps but not everything is packaged. When building from source, you need to make sure that the dependencies aren't automatically built from source as well, even if they are already installed as distro packages.

https://github.com/berkeley-dsep-infra/datahub/blob/staging/deployments/datahub/image/install.R

I set upgrade_dependencies = FALSE in devtools::install_github to prevent that.

Unrelated, I also rely on the read-only repositories in the "cran" org to help pin the version for reproducibility. It's often easier than finding the upstream repo.

betatim commented 6 years ago

Another thing to check/update the docs: the Ubuntu package versions will not correspond to the MRAN date set in the runtime.txt.

Chris can we update the top comment with a to do list of these points/things to check?

All these caveats make me wonder if we should leave using Ubuntu packages to get faster installs as an "expert thing you can do if you know you can do it"? For most repositories there should be a lot more launches of already built images than builds. With that in mind and a goal of reproducibility it might be better to go for slow but simpler??

karthik commented 6 years ago

The disconnect with mran is an issue. The big challenge though is that the tidyverse is very popular package and most binder use cases will require it. I agree that it would be critical to make the speed versus reproducibility tradeoff clear.

Slow builds are totally fine but I was having issues with build failures on a real world use case (6-8 medium to large(ish) packages)

choldgraf commented 5 years ago

@betatim I think that if the result is a build failure due to simply trying to build the tidyverse + something small-ish, then we need to change something to make this a more tenable option for people. Another, more complicated option would be to go back to using a Rocker image with some of the bigger packages pre-installed (or we could just recommend that people do this)

betatim commented 5 years ago

Does someone have a link to a repository/configuration that fails to build? That would make it easier to investigate why exactly it fails. If it is the amount of RAM required we should move the discussion to mybinder.org-deploy and with our mybinder hats on discuss increasing the RAM.

Building things like tidyverse straight from master takes a while but does succeed. This is about the biggest package I know.

I'd be hesitant with using rocker as we will end up with two different base images which will make it tricky to mix build packs :-/

choldgraf commented 5 years ago

actually @SylvainCorlay made a good point that we could also recommend that people install with conda-forge repositories...

cboettig commented 5 years ago

I'm not really up to speed with Python package management, but I don't believe this is really a problem that is in any way unique to R. I think it would be immensely difficult for CRAN to provide pre-built binaries for all linux distributions for all packages. Even within a single linux distribution like Debian, the maintainers don't manage to provide pre-built binaries for all CRAN packages, and those that are provided can often lag behind the CRAN versions (which after all are updated continuously, unlike the Bioconductor R packages).

More importantly, I think that mixing and matching prebuilt binaries can lead to really undesirable situations (e.g http://vincebuffalo.org/notes/2017/08/28/notes-on-anaconda.html). It is really hard to know what versions of what external libraries different binaries are going to be built against and how they will be compatible etc etc. I would never recommend R users install conda-forge because of these issues.

Docker gives us a really nice way of creating a consistent and transparent build process with control over our libraries and versions by pre-building things from source, and then we just pull down the Docker image as the 'ultimate' pre-built binary. In this way, it's really easy to know which version of jags, libgdal, and other critical libraries are being linked, which compiler versions are used and so forth.

I think it's strongly preferable to define the environment you want in your base Docker images rather than try to pull in binaries built under more opaque settings out of your control. Of course that means taking responsibility to maintain that stack, which may not be ideal -- that's the purpose of having the rocker images in the first place that already do exactly this.

I do recognize @betatim 's point about mixing build packs, but at least in this way you can do so in a way that is more transparent, rather nesting virtualization inside virtualization. The rocker images are built on vanilla debian images -- the versioned stack builds on the stable debian release with no additional apt repositories, so you have a very predictable and standard set of libraries to build against.

Okay, many apologies for the rant, you probably know these details much better than I. just have seen a lot of thorny install and runtime issues, both with python and R, that arise from chimeric combinations of different prebuilt binaries and virtualized environments.

choldgraf commented 5 years ago

Thanks for the input @cboettig - I'm trying to synthesize the main ideas from your post, are you suggesting that R users with repo2docker/Binder would be better off using a Dockerfile and sourcing their image from the Rocker images?

The biggest challenge with this is that repo2docker's goal is to let users "start from scratch" and only add the packages they want in an explicit fashion. Moreover we are intentionally treating anything that requires knowledge of Docker as "advanced" and probably not suitable for most users.

It sounds like our potential solutions are:

  1. Tell users to compile packages during the build (as we do now), which takes up extra time and has some constraints with RAM/CPU per @karthik's first post
  2. Tell users to install from either APT or Anaconda, which may be faster but obscures some of the lower level dependency versioning in unpredictable ways
  3. Tell users to use a Dockerfile that sources a Rocker image and give them instructions on how to get this working with repo2docker, which is sort of a hack around the whole point of repo2docker (which is to avoid using Dockerfiles) :-)

Am I missing something?

betatim commented 5 years ago

I'd challenge the assumption in 1) that some dependencies don't build because of limits on mybinder.org. I asked a while ago for examples and so far no one has pointed us to one so I'd change it to "Tell users to compile packages during the build (as we do now), which takes up extra time".

I think it is still unclear why there aren't more binary packages for R on linux given that for Python packages (which require a compiler during install) there are binary packages for linux (https://github.com/pypa/manylinux) and conda provides binaries that work on most linux systems. Working on this seems like something that would be very useful to the wider R community.

ryanlovett commented 5 years ago

There are binaries available from the c2d4u PPA. I asked Michael Rutter a while back if he had considered making daily snapshots and he said it "would be a great service, I just don't have time to create such a thing at this point." If CRAN were to mirror this PPA I wonder if MRAN would snapshot it along with everything else.

cboettig commented 5 years ago

I agree with @betatim that it's not obvious to me that there is an issue here in the first place with just letting packages install from source. Does my-binder cache the builds? I think it does at least when I send it a Dockerfile (e.g. I see it build my image slowly the first time, but next time I click the button it's fast, so no problem). That said, a good test case to create a slow build might be:

install.packages(c("tidyverse", "dplyr", "ggplot2", "sf", "rstan", "rjags"), dep=TRUE)`

(dep=TRUE installs suggested / build-time dependencies as well).

I don't follow Tim's point about their not being lots of binaries for linux -- there's binaries available for almost all packages for most of the popular distros. It's just that linux binaries in R are usually distributed through the distro's standard package managers, while I think Tim is referring to python binaries that are platform-agnostic and distributed through some other means. It is probably just my ignorance here, but I don't see why that would be preferable; it sounds like it could be awful difficult to get working in all cases.

My 2c is that the current setup that relies on install.packages() for R users is pretty good for most users and should remain the default. And I think the fallback solution for edges cases should be a Dockerfile.

I think using other alternatives just creates more trouble than it is are worth -- folks may use the approach when it's not necessary and install.packages() will do, and run into opaque issues that are hard to debug, meanwhile most real edge cases may still fail.

I have never liked the idea inventing yet another custom solution such as an apt.txt (or the endless config setups you see in various CI platforms and other projects in this space for specifying dependencies). The resulting behavior of such approaches depends on details that are not part of the config but set upstream, and are opaque to me as a user. They are hard to reason about, and hard to get help on, and often fragile in the long run. I know writing a Dockerfile sounds advanced, but to most users one config file is just as opaque and scary as the next, and there's a lot more help out there on writing Dockerfiles.

choldgraf commented 5 years ago

My original comments about super long and/or memory-limiting builds in R were just from @karthik telling me his attempts at building a tidyverse repo weren't working...maybe that isn't reproducible though?

(as a friendly aside, while I enjoy the back-and-forth about the merits of linux binaries etc for the R community, can we keep conversation in this thread to helping us decide what to recommend for R users in repo2docker? :-) )

karthik commented 5 years ago

@choldgraf I can give you a reproducible example next week. Still takes a very long time to build (as you witnessed once).

betatim commented 5 years ago

Does mybinder.org cache the builds?

Yes, for each commit that is launched we build it once and if the build succeeds the next time you click the link you get a built image and don't incur build time delay.

Building the tidyverse from master on mybinder.org takes so long that I close my browser (the build continues) and come back later.

My point about binaries was that I didn't quite understand why install.packages has access to binaries for OSX and Windows, but not for linux. From some of the comments it sounded like the reason was technical, which is why I brought up manylinux as an example that works. From the latest comments it sounds like historically(?) people started using the package managers of their distro, which would explain it.

cboettig commented 5 years ago

why install.packages has access to binaries for OSX and Windows, but not for linux

ah, I see your point. To me it's always been the first part of this that is more surprising and more unusual. My experience with just about every other language has been that not providing platform-specific binaries is the default.

I think R core decided relatively early on that many Mac and Windows users might not have the C & fortran compilers and dynamic system libraries installed or know how to install them, so it has long provided binaries for those platforms to make R easier to adopt (and I suspect this contributed greatly to adoption in R). For the Linux users, they probably thought "hey, you use Linux, you know how to do these things (or ask your sys admin)."

As you know, pre-building static binaries for packages that link external libraries can be a huge challenge -- no more dynamic linking, everything has to be packaged with the binary which also has different implications for software licenses etc. There are often packages that get stuck a few versions behind in the Mac and/or Windows builds for reasons, sometimes for years. A small volunteer team keeps these binaries going; sometimes with some pretty clever/convoluted config files. Until the advent of stuff like Ubuntu snaps or Docker, I was unaware of anything else that tried to tackle distribution/platform agnostic binaries.

choldgraf commented 5 years ago

@karthik it sounds like a main question to answer here is "does it just take a really long time, or does it never build". @betatim seems to think it's the former, though we should figure out if it's the latter in some (common) use-cases

karthik commented 5 years ago

I'm totally fine with taking a long time, esp since it will be cached for future runs. My issue was that a couple of real world examples (not just tidyverse) were never building.

yuvipanda commented 5 years ago

@cboettig for Python, the recommendation is to not install anything from apt, since those go out of date very quickly. Distributing binaries that work well is a PITA, but the Python community has actually done a really awesome job with https://github.com/pypa/manylinux, and most installs from PyPI bring in binaries on linux too now. This is how we sidestep this problem with requirements.txt for example.

@choldgraf @karthik could either of you have a reproducible example? We wanna make sure R users can use binder in a first class way, and will try to fix this.

cboettig commented 5 years ago

@yuvipanda Thanks! This does a good job of showing differences in approaches in different communities.

Of course you know this, but because I think it could confuse other readers here: I think it is very misleading to stay "not install anything from apt, since those go out of date very quickly. " apt is as up-to-date as the repos you choose -- if your apt/sources.list points only at some LTS distribution that was released 3 or 4 years ago, things are way out of date. If you use something more current, like debian:testing, debian:unstable, or you add some community PPAs that provide nightly builds, you can get something much more current. Michael Rutter in particular maintains a widely used PPA for most but not all CRAN packages that is quite up to date; and the R spatial community depends heavily on the ubuntugis PPA for more recent binaries.

The manylinux approach is very interesting, definitely a different way to go from providing a PPA (and somewhat more general). It looks to me like the compromise there is that everything is being compiled with ancient compilers of Centos 5 in order to be compatible. Some R users don't like that the rocker/versioned stack is building the latest versions on debian:stable compilers, which are modern by comparison -- which is another reason we have the separate debian:testing based stack (and of course you can just add the above-mentioned PPAs to most recent ubuntu systems). I only mention this just to underscore the point I think we can all agree on, that providing pre-built binaries is a fundamentally hard problem with inevitable trade-offs. I'm all for ameliorating the situation for the user, but I'm perpetually wary that we will get ourselves in trouble with overly general statements or trivializing these challenges.

yuvipanda commented 5 years ago

I'm with you all the way, @cboettig - especially on 'binaries are hard'!

To qualify my statement about apt, I'll say 'do not install python packages with apt'. Unlike the R community, which seems to provide up to date builds of many libraries on apt in a timely fashion, python library upgrades are a lot more scattered & unpredictable. For example, the most bleeding edge version of the popular django package on debian sid is almost 18 months out of date. There are no PPAs that are generally known to provide known good builds of python packages. There is no pressure for any improvement here since manylinux1 exists.

I think @karthik and @choldgraf's original issue is that builds from install.R are too slow & time out. Recommending people install R packages from apt whenever possible is one solution. However, builds should never really time out, so that's a bug in mybinder.org / binderhub that we'd love to get to the bottom of - completely separate from preferred methods of installing packages.

cboettig commented 5 years ago

Yup, :100:% on the same page.

Re slow builds, I've just mocked up: https://github.com/cboettig/r with an install.R file that should take a while to build (I think over an hour on most machines, and some installs like rstan may fail to compile if the machine doesn't have at least a gig or so free memory). I adjusted the binder links and queued a build on binder, let's see what happens.

karthik commented 5 years ago

Thanks for this @yuvipanda and @cboettig

I'll come up with a couple of examples and update here tomorrow.

cboettig commented 5 years ago

@yuvipanda note that I get something that looks like an error but isn't one if I sleep my machine a re-open it. Binder web page favicon changes to an error icon, and the log shows: "Error: Failed to connect to event stream". Refresh the page and all looks well again (log is still chugging away compiling source-code from R packages... about 53 minutes in now) , but a user could mistake this for a build error.

nuest commented 5 years ago

I have played around a bit with source vs. apt installation in the context of https://github.com/jupyter/repo2docker/pull/457 Though it can't contribute much despite confirming the "pre-build binaries is hard", it shows the additional challenge of binary R packages installed by repo2docker and source packages installed by a user vs. the Docker build chaching. I did not get it to work, might be a stupid error..

https://gist.github.com/nuest/8beca3b75bba97f107a314798879a2fc

cboettig commented 5 years ago

My example mentioned above built successfully (allbeit without sf installed, since I hadn't updated the MRAN date). Not sure how long it took to build (certainly over an hour), but this morning when I clicked the button it launched from the cache version nice and spiffy-like.

choldgraf commented 5 years ago

@cboettig would you say that the build you triggered is a "typical" workflow for an R user? Not "the whole stack they ever use" but "a reasonable stack to reproduce an analysis"? If so, then we should really find a way to make people not wait an hour for that to build :-/

karthik commented 5 years ago

a reasonable stack to reproduce an analysis

That is exactly what I was trying to do at Numfocus. I have a few ideas I'd like to discuss (and demo). Are you and @yuvipanda around for a chat next week at BIDS?

cboettig commented 5 years ago

@choldgraf "typical" will vary widely of course, but it's realistic or even small for large spatial analysis.

betatim commented 5 years ago

PPA support

Currently adding PPAs is impossible, apt.txt doesn't support it and postBuild doesn't run as root. I don't recall why we made it like this, I just remember that we have discussed it before. Might be worth a trip to the archives.

Pre-installing some packages via a different mechanism that can then be re-installed depending on which command exactly users use feels like a road to lots of maintenance burden and user confusion :-/ I think when considering this we should measure it against telling users to use (pre-made) Dockerfiles that use APT packages or rocker base images. Here we'd be building on something that already has adoption and "only" have to point users in the right direction.

I am with Carl that having to wait the first time isn't a bit deal on a hosted platform like mybinder.org. When it becomes annoying is when you use repo2docker locally. To speed up local r2d use there are some ideas being implemented that could help: #461 #478 and #410.

yuvipanda commented 5 years ago

I am making a binder-examples repo that uses the rocker base images here: https://github.com/binder-examples/rocker

This should be much faster than the install.R method for this specific use case.

I've also opened https://github.com/rocker-org/binder/pull/30 to add better shiny support to the rocker/binder image.

LennertSchepers commented 5 years ago

My example mentioned above built successfully (allbeit without sf installed, since I hadn't updated the MRAN date). Not sure how long it took to build (certainly over an hour), but this morning when I clicked the button it launched from the cache version nice and spiffy-like.

I just tried the @cboettig example (https://mybinder.org/v2/gh/cboettig/r/master?urlpath=rstudio), but it looks like the sf package is still not installed, maybe because sf depends on quite some spatial libraries (e.g. GDAL, GEOS,...)?

Nevertheless, having never used docker, I found it very easy to copy this small Dockerfile (linking to rocker/binder) to my repository: https://github.com/rocker-org/binder/blob/master/binder/Dockerfile. Now sf and all other packages work like a charm. I'm impressed by Binder, thank you!

cboettig commented 5 years ago

@LennertSchepers that's a good point -- using an install.R as in my little example (which is just a modified fork of binder-examples/r) is a bit misleading since the R function install.packages() does not throw an ERROR when a package fails to install.

Like you, I also use the rocker-org/binder Dockerfiles as a base image instead for most of my binder projects, since the spatial system libraries are all pre-installed there.

cboettig commented 5 years ago

Also on a related note: it would probably be preferable to list Dependencies in a DESCRIPTION file and use devtools::install_deps() in the install.R script, rather than calling install.packages() directly. (I think this should error properly when libraries are missing, and that has the added advantage of not re-installing any package that is already installed, which can make for much shorter build times. This also means that dependency (and possibly other metadata) can be listed in a more standard, easily parsed file format (DESCRIPTION files follow Debian control file format). I've update my fork to test that approach...

betatim commented 5 years ago

DESCRIPTION files should get recognised by repo2docker already: https://repo2docker.readthedocs.io/en/latest/config_files.html#description-install-an-r-package

betatim commented 5 years ago

Ref: #716 for instant rebuilds