opensafely-core / r-docker

Docker image for running R scripts in OpenSAFELY
1 stars 3 forks source link

Rebuild R image #121

Closed bloodearnest closed 1 year ago

bloodearnest commented 1 year ago

We cannot currently rebuild the R image from scratch. We have to add on to the existing R image we have.

This prevents various improvments, and means the R image is a special snowflake compared to our other images

remlapmot commented 1 year ago

If helpful have done this in 3 branches using 2 different approaches

(in these I previously used some MRAN/Microsoft snapshot URLs but sadly Microsoft are discontinuing that service at the end of this month, news here)

In each case you can run the master-branch-name.sh script, i.e, master-branch-01.sh to build - depending on speed of internet connection branches 01 and 02 run in under 20 minutes because the packages are prebuilt.

bloodearnest commented 1 year ago

Hi Tom.

We have an approach that seems to be working based on using renv to build specific versions of libraries. I'm currently running a test build of a new image against all OpenSAFELY R code (it will take a while...) to make sure it doesn't break anything.

My approach used renv in a similar way to your renv branch, AFAICT, except:

Like your branch, we're also switching from 18.04 to 20.04 as the underlying series, mainly because 18.04 is nearly EOL. This does mean some of the underlying system libraries have changed slightly, but the R libraries are all the same.

I'll work on getting a PR up, and I'd love your feedback on it!

Once we've switched, we can work through some of the other issues you've called out, once we have a stable base to work from.

We want move away from using a single :latest version of all our runtime images, and towards explicit versions, e.g. run: r:4.0 ... or run: r:4.2. When we do that, we'll potentially be in a position to switch to using pre-built archives, and use more of rocker's tooling to build images.

remlapmot commented 1 year ago

Sounds good Simon, I'm happy to look.

I assume from that package name being specified at 4.0.5, that will bump the version of R from 4.0.2 to 4.0.5. In general guess that it's good to be at the end of a patch series. Posit/RStudio only provide end of patch series versions of R (in addition to the current version) in their posit.cloud environment. Or was that a typo?

Another reason it's good to do this, is that although the tidyverse/Posit/RStudio policy is for their packages to work with the last 5 minor releases of R - which usually equates to 5 years - there are more packages on CRAN by other teams starting to require R version 4.1.0 because I think that's when the native pipe was introduced to R (|> as opposed to magrittr/dplyr pipe %>%). I ran update.packages(ask = FALSE) in the container and it only failed to update 1 package - Gmisc - due to that package using the native pipe. So it would be good to have a subsequent tagged version using at least R 4.1.0.

bloodearnest commented 1 year ago

Ok, PR is here!

https://github.com/opensafely-core/r-docker/pull/123

bloodearnest commented 1 year ago

Sounds good Simon, I'm happy to look.

I assume from that package name being specified at 4.0.5, that will bump the version of R from 4.0.2 to 4.0.5. In general guess that it's good to be at the end of a patch series. Posit/RStudio only provide end of patch series versions of R (in addition to the current version) in their posit.cloud environment. Or was that a typo?

Yes this is deliberate, to bring us up to date with the latest 4.0 release. This should be backwards compatable ugrade, and didn't seem to cause any issues in testing, and is easy enough to rollback if we need to.

Another reason it's good to do this, is that although the tidyverse/Posit/RStudio policy is for their packages to work with the last 5 minor releases of R - which usually equates to 5 years - there are more packages on CRAN by other teams starting to require R version 4.1.0 because I think that's when the native pipe was introduced to R (|> as opposed to magrittr/dplyr pipe %>%). I ran update.packages(ask = FALSE) in the container and it only failed to update 1 package - Gmisc - due to that package using the native pipe. So it would be good to have a subsequent tagged version using at least R 4.1.0.

Yep.

I'd like to have publish an r:4.2 image, with the same set of libraries, but at their latest versions. Then OpenSAFELY users can opt in to that by using r:4.2 in their project.yaml.

But we'll need to do that as a series of steps. We'd probably try take a different approach, using pre-built CRAN packages rather than building from source.

remlapmot commented 1 year ago

great thanks indeed Simon

(I have teaching stress on Tuesday, so it might take me until Wednesday to have a look at the PR.)

It would be great to make pre-built binary CRAN packages - to do that you need to make what is called a CRAN-like repository. For my own interest and also because Iain mentioned this a few months ago I wrote a blog post about how to do that for Linux binary packages

https://remlapmot.github.io/post/2022/make-linux-binary-cran-like-repo/

I know of 2 organisations which have publicly available CRAN-like repos with Linux binary packages - the Posit/RStudio package manager

https://packagemanager.posit.co/client/#/repos/2/overview

which make prebuilt binaries available for Bionic, Focal, and Jammy (as well as several other distros - it's incredibly impressive, as there are snapshots as well)

and the other is the R4PI project (which is actually run by one of the Posit/RStudio developers and uses the same technique)

https://r4pi.org/

The R4PI GitHub org is here

https://github.com/r4pi

I think the build scripts for its CRAN-like repo are in this repo:

https://github.com/r4pi/pkg_builder

It's two CRAN-like repos for the PI are available from

https://pkgs.r4pi.org/ https://pkgs.r4pi.org/armv7l/index.html https://pkgs.r4pi.org/aarch64/index.html