rocker-org / rocker-versioned2

Run current & prior versions of R using docker. rocker/r-ver, rocker/rstudio, rocker/shiny, rocker/tidyverse, and so on.
https://rocker-project.org
GNU General Public License v2.0
414 stars 180 forks source link

Dockerfile update(ex. CRAN URL) improvement #162

Closed eitsupi closed 3 years ago

eitsupi commented 3 years ago

Currently, the configuration of the CRAN URL is done manually, and there have been several misconfigurations in the past. #127 #141 (By the way, the CRAN URL for 4.0.5 is currently set to May 19, 2021, but since 4.1.0 was released on May 18, wouldn't it be better to re-set it to May 17?)

I think the following steps to determine the CRAN URL can be automated, and GitHub Actions may be used to automatically update the URL.

  1. Detect the R release date.
  2. Detect the Ubuntu LTS's codename.
  3. Find the CRAN URL closest to the R release date.

We can check the release date of R by referring directly to the R SVN repository with the rversions::r_versions function.

$ docker run --rm -it rocker/tidyverse Rscript -e "tail(rversions::r_versions())"
    version                date                nickname
118   4.0.1 2020-06-06 07:05:16          See Things Now
119   4.0.2 2020-06-22 07:05:19        Taking Off Again
120   4.0.3 2020-10-10 07:05:24 Bunny-Wunnies Freak Out
121   4.0.4 2021-02-15 08:05:13       Lost Library Book
122   4.0.5 2021-03-31 07:05:15         Shake and Throw
123   4.1.0 2021-05-18 07:05:22         Camp Pontanezen

The release dates and codenames of Ubuntu can be found in the /usr/share/distro-info/ubuntu.csv file included with Ubuntu.

$ docker run --rm -it rocker/tidyverse Rscript -e "tail(read.csv('/usr/share/distro-info/ubuntu.csv'))"
     version          codename  series    created    release        eol
29     18.10 Cosmic Cuttlefish  cosmic 2018-04-26 2018-10-18 2019-07-18
30     19.04       Disco Dingo   disco 2018-10-18 2019-04-18 2020-01-18
31     19.10       Eoan Ermine    eoan 2019-04-18 2019-10-17 2020-07-17
32 20.04 LTS       Focal Fossa   focal 2019-10-17 2020-04-23 2025-04-23
33     20.10    Groovy Gorilla  groovy 2020-04-23 2020-10-22 2021-07-22
34     21.04     Hirsute Hippo hirsute 2020-10-22 2021-04-22 2022-01-22
   eol.server    eol.esm
29
30
31
32 2025-04-23 2030-04-23
33
34

The existence of the CRAN URL can be checked by using the pak::repo_ping function included in the development version of the pak package.

$ docker run --rm -it rocker/tidyverse Rscript -e "install.packages('pak', repos = 'https://r-lib.github.io/p/pak/dev/'); pak::repo_ping(cran_mirror = 'https://packagemanager.rstudio.com/cran/__linux__/focal/2021-05-17')"
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://r-lib.github.io/p/pak/dev/src/contrib/pak_0.1.2.9001_R4-0_x86_64-pc-linux-musl.tar.gz'
Content type 'application/gzip' length 9575936 bytes (9.1 MB)
==================================================
downloaded 9.1 MB

* installing *binary* package ‘pak’ ...
* DONE (pak)

The downloaded source packages are in
        ‘/tmp/RtmpVm1wPh/downloaded_packages’
Repository summary:                          source
CRAN          @ packagemanager.rstudio.com     ✔      (1.1s )
BioCsoft      @ bioconductor.org               ✔      (677ms)
BioCann       @ bioconductor.org               ✔      (683ms)
BioCexp       @ bioconductor.org               ✔      (838ms)
BioCworkflows @ bioconductor.org               ✔      (943ms)

I was able to use GitHub Actions to automatically update the CRAN URLs in my repository, which consolidated the management of the URLs into a single file. https://github.com/eitsupi/r-ver/pull/57/files

cboettig commented 3 years ago

thanks, yeah, automating the updates for a new release would be brilliant. interested in prepping PR for this?

Fixed date to 2021-05-17

eitsupi commented 3 years ago

Yes, of course I would like to contribute with PR. I think I can provide the following files at this time.

However, I don't yet know how to use that file to generate stacks json files. So you need to manually post the CRAN URLs, etc. from the generated file to the stacks files, is that OK?

cboettig commented 3 years ago

@eitsupi very nice. I don't think it would be all that tricky to go from your R script generating the versions and CRAN URL to one that updates the stack files?

Still, our current stack file pattern which requires a new .json file be created at each new release is still rather cumbersome, with a lot of duplication. It would be preferable for a stack file to simply have an array for the different R versions, instead of a new stack file for each version. This merely needs a good way of distinguishing between env vars etc that remain fixed over each new release vs those like the CRAN URL that need to be updated. Of course the R script that generates the Dockerfiles from the stacks would need updating to the new syntax. In that manner, a new release would be fully automated while also streamlining the config file situation a little more.

I think that's quite do-able but haven't had a chance to carve out the time!

eitsupi commented 3 years ago

@cboettig I agree that it is best to separate the variable part of the stack file from the fixed part, and when upgrading, update only the one file that contains the variable.

Probably the easiest way is to give the build args to the ARG on the Dockerfile when docker build, but that's different from the current build system which has a separate dockefile for each image. I also don't know if DockerHub supports such a build system; we can use GitHubActions to build it (if it doesn't time out...).

First of all, we need to be able to generate variables automatically, so let's focus on core-4.0.0.json, and the variables are the following parts.

"TAG": "4.0.0"
"FROM": "ubuntu:20.04"
"R_VERSION": "4.0.0"
"CRAN": "https://packagemanager.rstudio.com/cran/__linux__/focal/291"
"FROM": "rocker/r-ver:4.0.0"
"S6_VERSION": "v2.0.0.1"
"RSTUDIO_VERSION": "1.3.959"
"FROM": "rocker/rstudio:4.0.0"
"FROM": "rocker/tidyverse:4.0.0"
"CTAN_REPO": "http://www.texlive.info/tlnet-archive/2020/06/05/tlnet"

Currently, the things I can't generate using the above procedure and have to set manually are S6_VERSION, RSTUDIO_VERSION and CTAN_REPO. (Also, ubuntu:20.04 should be set to ubuntu:focal and CRAN should use a date-based format)

It seems that CTAN_REPO is a date-based URL and is generated daily, so it can be generated automatically. I think S6_VERSION and RSTUDIO_VERSION can be easily generated automatically by getting the release information from GitHub, so I'll check how to do it. Update RStudio 1.4.1106 has a release date of 2021-02-11 available on GitHub, but it seems that the binary release date was 2021-03-02, so the date available on GitHub cannot be used.

I don't think it would be all that tricky to go from your R script generating the versions and CRAN URL to one that updates the stack files?

I thought there would be more parts that I would need to configure manually, but it certainly looks like the core stack file can be generated automatically. Looking at other definitions, the only other variables I could find were CUDA_VERSION and NCCL_VERSION, which are used in ml-cuda. I do not know where these two come from .......

cboettig commented 3 years ago

Thanks.

I also don't know if DockerHub supports such a build system; we can use GitHubActions to build it (if it doesn't time out...).

Starting with 4.x / versioned2 we stopped using the DockerHub automated builds in the versioned stack anyway, since they do not support large numbers of tags, literally could not add more versions to rocker/r-ver automated build. So our builds are already local and/or GitHub Actions (where build timeout is not usually an issue (particularly with RSPM binaries), but network failures on deploy to DockerHub and image sizes are a real issue; runners are too small to build the ml stack and some others.

Currently, the things I can't generate using the above procedure and have to set manually are S6_VERSION, RSTUDIO_VERSION and CTAN_REPO.

We've actually kept S6_VERSION mostly locked, upgrading as needed only -- it hasn't always been safe to upgrade S6_VERSION without sufficient testing since changes there can change the way the init config files work. Automating the CTAN repo to the date-based archive snapshots should be fine, note the archive should only be used for frozen images and not current release since it has limited capacity. There's scripts in littler and in install_rstudio.sh I think that show getting the latest R version (not from GitHub release info).

I thought there would be more parts that I would need to configure manually, but it certainly looks like the core stack file can be generated automatically [...] only other variables I could find were CUDA_VERSION and NCCL_VERSION,

Yeah, happy to start with core stack. The ML stack variables are fixed based on the corresponding locks in nvidia/cuda, but all of that is still a bit of a work in progress and looks like we can do some more sync-ing up there.)

eitsupi commented 3 years ago

Starting with 4.x / versioned2 we stopped using the DockerHub automated builds in the versioned stack anyway, since they do not support large numbers of tags, literally could not add more versions to rocker/r-ver automated build. So our builds are already local and/or GitHub Actions (where build timeout is not usually an issue (particularly with RSPM binaries), but network failures on deploy to DockerHub and image sizes are a real issue; runners are too small to build the ml stack and some others.

I was wondering why the images was being updated when GitHubActions was not running, does that mean it was being built locally? Thank you for all your hard work.

If we only consider building with GitHubActions, I think we can use multi-stage build and build matrix to build all images with a single Dockerfile and a json file with variables. (You may need to split your workflow into multiple workflows to get around the GitHubActions limitation, but in any case, you'll have very few files to maintain.)

We've actually kept S6_VERSION mostly locked, upgrading as needed only -- it hasn't always been safe to upgrade S6_VERSION without sufficient testing since changes there can change the way the init config files work.

OK, it looks like S6_VERSION should be hard-coded and updated manually.

Automating the CTAN repo to the date-based archive snapshots should be fine, note the archive should only be used for frozen images and not current release since it has limited capacity.

As with CRAN, I will make sure that the latest version is set to a dedicated value.

There's scripts in littler and in install_rstudio.sh I think that show getting the latest R version (not from GitHub release info).

Since the date used for the variable is fixed after the next version of R is released, I think we need to generate the variable for at least the last two versions. In other words, we need to get the latest version number of RStudio for the past R release date. (It is possible to get the latest version number of RStudio on the release date by checking the version only when the release date of R matches the date when the script was executed, but it is not desirable to miss that date because it will not work properly.)

I would like to try to determine the RStudio version from the GitHub release information and use the previous version if the binary has not been released yet. The disadvantage of this method is that the RSTUDIO_VERSION of the old images may be rewritten when the RStudio binary is released, but I don't think this will be a problem since it is hard to imagine that RStudio does not support the last two versions of R.

cboettig commented 3 years ago

Since the date used for the variable is fixed after the next version of R is released, I think we need to generate the variable for at least the last two versions

right, good call. yeah this sounds fine, I don't think overwriting the old RSTUDIO_VERSION should be an issue, though hopefully most of the time it will be the same version as previously(?)

does that mean it was being built locally?

Correct. runs on one of my servers so could be automated by CRON job easily but is not at the moment. Ideally we'd still build from GH-Actions cron jobs as much as possible (i.e. at least for the smaller images), so it would be great for an automated pipeline to bump the versions on the gh-actions config files too (or maybe better, switch those files over to using methods from the Makefile so they can just call something like make core-latest instead)

eitsupi commented 3 years ago

Correct. runs on one of my servers so could be automated by CRON job easily but is not at the moment. Ideally we'd still build from GH-Actions cron jobs as much as possible (i.e. at least for the smaller images)

Thank you very much for using your own server to build images! It's very easy to set up scheduled execution for GitHubActions, so it seems good idea to do so right away, especially for daily build "devel". Just write something like this. https://github.com/eitsupi/r-ver/blob/cc7f15dbcb644c5b73e7534159f1a6c309576db3/.github/workflows/docker-build-push.yml#L1-L8

it would be great for an automated pipeline to bump the versions on the gh-actions config files too

Updating the workflow definition file itself is good, but I think an easier way is to generate a matrix based on an external json file from the workflow and reference the variables in the workflow. This can be achieved as follows. https://github.com/eitsupi/r-ver/blob/cc7f15dbcb644c5b73e7534159f1a6c309576db3/.github/workflows/docker-build-push.yml#L19-L30

The dynamic matrix generated from json is described in the following post. https://github.blog/changelog/2020-04-15-github-actions-new-workflow-features/

By the way, I've never used GNU Make, so I may need to study it......

cboettig commented 3 years ago

Yup, cron triggers have been on the to-do list for a while, https://github.com/rocker-org/rocker-versioned2/issues/13, but never pulled the trigger since most of those gh-actions builds haven't been all that stable. Starting with the devel build on CRON makes :100: sense though!

Your build-matrix looks great, very clever! (will take me a while to quite wrap my head around it though). Definitely seems the way to go, would streamline the gh-actions setup a lot.

eitsupi commented 3 years ago

My new script is now able to generate RSTUDIO_VERSION automatically. (Due to the GitHub API specification, we can't use it anonymously multiple times in a row.)

https://github.com/eitsupi/r-ver/blob/816fbdb74da5bdcd457d8288cf8337aa0e761ed6/buildargs/versions.json

I'll try to convert it to stack files this weekend.

eitsupi commented 3 years ago

I'm closing this issue because we've achieved the original goal of automatically updating the CRAN URL for rocker/r-ver Dockerfile. I have created a new issue #181 that covers automation more broadly.

eitsupi commented 2 years ago

The disadvantage of this method is that the RSTUDIO_VERSION of the old images may be rewritten when the RStudio binary is released, but I don't think this will be a problem since it is hard to imagine that RStudio does not support the last two versions of R.

Note: This actually occurred with the RStudio version 2022.02.2+485 released today. (#433)