technologiestiftung / flusshygiene-opencpu-base

Base image with all our dependencies for the opencpu kwb-f/fhpredict api
MIT License
0 stars 2 forks source link

Question: Should we pin the versions of these packages? #4

Closed ff6347 closed 5 years ago

ff6347 commented 5 years ago

@hsonne As the title says. Currently on every build wee take the latest version. IMHO we should pin these down to a specific version. If you agree can you provide the syntax and the versions or create a PullRequest/ new branch for it?

https://github.com/technologiestiftung/flusshygiene-opencpu-base/blob/09c87948bcaa2a22f4486fe9dbb9038b5689bcac/Dockerfile#L18

hsonne commented 5 years ago

@fabianmoronzirfas, please try the following:

RUN R -e "install.packages('remotes'); 
          remotes::install_version('curl', version = '4.0');
          remotes::install_version('fs', version = '1.3.1');
          remotes::install_version('httr', version = '1.4.1');
          remotes::install_version('lubridate', version = '1.7.4');
          remotes::install_version('raster', version = '2.8-19');
          remotes::install_version('Rcpp', version = '1.0.2');
          remotes::install_version('rstanarm', version = '2.18.2');
          remotes::install_version('sf', version = '0.7-4');
          remotes::install_version('sp', version = '1.3-1');"

I used the following R commands to create the main part of the above code:

pkgs <- c('rstanarm', 'sf', 'fs', 'raster', 'sp', 'lubridate', 'httr', 'Rcpp', 'curl')
installed <- installed.packages()
columns <- c("Package", "Version")
versions <- installed[rownames(installed) %in% pkgs, columns]
cat(paste(collapse = "\n", sprintf(
  "remotes::install_version('%s', version = '%s');", 
  versions[, "Package"], versions[, "Version"]
)))
ff6347 commented 5 years ago

fs is missing in the list

hsonne commented 5 years ago

Interestingly, the package was not installed on my computer. I updated the list (see above). Maybe you should run the code that I provided from within the container to get the version numbers that were actually installed so far (as my packages may not all be in their most current version).

ff6347 commented 5 years ago

Build with the pinned version passes in 30 minutes. https://github.com/technologiestiftung/flusshygiene-opencpu-base/pull/6/checks?check_run_id=212976872#step:6:11566

ff6347 commented 5 years ago

@hsonne After changing the install to the versioned install you've suggested the build of the https://github.com/technologiestiftung/flusshygiene-opencpu-fhpredict-api takes pretty long again. Are you sure this the the right way to get the packages by version?

ff6347 commented 5 years ago

Do we need to provide the build = TRUE flag?

hsonne commented 5 years ago

@fabianmoronzirfas I assume that it takes so long because we do one install_version() call per package. The function installs all dependencies and maybe installs those dependencies, that occur in more than one package, multiple times. I found another solution that I would like you to test:

RUN R -e "tarballs <- c(
  'Rcpp_1.0.2.tar.gz',
  'curl_4.0.tar.gz',
  'fs_1.3.1.tar.gz',
  'httr_1.4.1.tar.gz',
  'lubridate_1.7.4.tar.gz',
  'raster_3.0-2.tar.gz',
  'remotes_2.1.0.tar.gz',
  'rstanarm_2.18.2.tar.gz',
  'sf_0.7-7.tar.gz',
  'sp_1.3-1.tar.gz'
);
urls <- paste0('https://cran.r-project.org/src/contrib/', tarballs);
install.packages(urls, repos = NULL, type = 'source')"

The listed versions are the most current versions that are found on "https://cran.r-project.org/src/contrib/". Unfortunately, the files are moved to "https://cran.r-project.org/src/contrib/Archive/<package>" once there is a newer version of <package>. So maybe we should always use the most recent archived version. For that case, the installation instructions are:

RUN R -e "paths <- c(
  'Rcpp/Rcpp_1.0.1.tar.gz',
  'curl/curl_3.3.tar.gz',
  'fs/fs_1.3.0.tar.gz',
  'httr/httr_1.4.0.tar.gz',
  'lubridate/lubridate_1.7.3.tar.gz',
  'raster/raster_2.9-23.tar.gz',
  'remotes/remotes_2.0.4.tar.gz',
  'rstanarm/rstanarm_2.18.1.tar.gz',
  'sf/sf_0.7-6.tar.gz',
  'sp/sp_1.2-7.tar.gz'
);
urls <- paste0('https://cran.r-project.org/src/contrib/Archive/', paths);
install.packages(urls, repos = NULL, type = 'source')"

For the moment, I prefer to go with the most recent versions of the first solution above, otherwise I have to "downgrade" the version requirements of our packages kwb.dwd and fhpredict...

ff6347 commented 5 years ago

I think you missunderstood me. We do the install on this image here in the repo which serves as the base.

Then we use this image as the base for the api image.

I do all the install here and when I build the image for the API it seems that the installs are running again. As if the packages where not present.

ff6347 commented 5 years ago

@hsonne ☝️

mrustl commented 5 years ago

Maybe checking the workflow used in the R package containerit is helpful: https://github.com/o2r-project/containerit

ff6347 commented 5 years ago

@mrustl thanks for the hint. containerit uses rocker images and rocker uses these scripts from litter. It does not look like there is version management in there.

I think rocker is more mature then opencpu in case of docker setup but for the time beeing we need to work with the opencpu image as base.

mrustl commented 5 years ago

No version management, but a function for installing specific versions from CRAN similar to @hsonne proposal:

https://github.com/o2r-project/containerit/blob/b3834aae91dc922e78a069f28b0c221993202c66/tests/testthat/test_package_with-versions.R#L9

https://o2r.info/containerit/reference/versioned_install_instructions.html

ff6347 commented 5 years ago

So I guess the solution is in here -->. To much R for me :-)

hsonne commented 5 years ago

Running the following code...

containerit:::versioned_install_instructions(
  pkgs = data.frame(name = "sf", version = "0.7-7")
)

... results in:

[[1]]
An object of class "Run"
Slot "exec":
[1] "Rscript"

Slot "params":
[1] "-e"                                        "versions::install.versions('sf', '0.7-7')"

So, the docker file generated by the containerit package will contain calls to install.versions() from the versions package to install packages in a certain version. According to the documentation of that function, it can be given all package names and version strings at once so that we could try the following:

RUN R -e "install.packages('versions'); versions::install.versions(
  pkgs = c('curl', 'fs', 'httr', 'lubridate', 'raster', 'remotes', 'Rcpp', 'rstanarm', 'sf', 'sp'),
  versions = c('4.0', '1.3.1', '1.4.1', '1.7.4', '3.0-2', '2.1.0', '1.0.2', '2.18.2', '0.7-7', '1.3-1')
);"

However, I do not see why this should be different from using remotes::install_version()...

mrustl commented 5 years ago

I would recommend using the CRAN snapshot timemachine (MRAN) maintained by Microsoft:

https://stackoverflow.com/questions/39312260/use-checkpoint-and-a-mran-snapshot-as-cran-mirror-with-travis-ci

In total the flusshygiene R packages are KWB`s TOP3 R packages with >100 R package dependencies (see: https://github.com/KWB-R/pkgmeta/issues/3):

package n_dependencies n_recursive_dependencies
kwb.flusshygiene.app 12 109
fhpredict 13 107
kwb.flusshygiene 9 100

Dependencies are invitations for other people to break your package. -- Josh Ulrich, private communication

http://dirk.eddelbuettel.com/blog/2018/02/28/ http://www.tinyverse.org/

mrustl commented 5 years ago

Rocker also uses MRAN so CRAN R package versions are fixed by date (based on the container build date!)

https://github.com/rocker-org/rocker-versioned/blob/master/r-ver/3.6.0/Dockerfile#L113


 ## install packages from date-locked MRAN snapshot of CRAN
  && [ -z "$BUILD_DATE" ] && BUILD_DATE=$(TZ="America/Los_Angeles" date -I) || true \
  && MRAN=https://mran.microsoft.com/snapshot/${BUILD_DATE} \
``` (from: https://hub.docker.com/r/rocker/r-ver/dockerfile
ff6347 commented 5 years ago

Thanks. I'm testing this right now in PR #8

ff6347 commented 5 years ago

I guess doing a BUILD_DATE=$(TZ="Europe/Berlin date -I) is not what we want. This creates the date as the current date of the build if the env variable does not exists. I added it as a --build-arg

ff6347 commented 5 years ago

I guess doing a BUILD_DATE=$(TZ="Europe/Berlin date -I) is not what we want. This creates the date as the current date of the build if the env variable does not exists. I added it as a --build-arg

But… The saveguard for non exiting env variable is actually smart. The install fails if but without throwing an error. It just ends without installing the packages-

ff6347 commented 5 years ago

hm @mrustl @hsonne Any idea why the install is failing?

 > install.packages(c("remotes", "rstanarm", "sf", "fs", "raster", "sp", "lubridate", "httr", "Rcpp", "curl"), repo = 'https://mran.microsoft.com/snapshot/' );
Installing packages into '/usr/local/lib/R/site-library'
(as 'lib' is unspecified)
Warning: unable to access index for repository https://mran.microsoft.com/snapshot/src/contrib:
  cannot open URL 'https://mran.microsoft.com/snapshot/src/contrib/PACKAGES'
> 
> 
Warning message:
packages 'remotes', 'rstanarm', 'sf', 'fs', 'raster', 'sp', 'lubridate', 'httr', 'Rcpp', 'curl' are not available (for R version 3.6.1)

https://github.com/technologiestiftung/flusshygiene-opencpu-base/pull/8/checks?check_run_id=228084450#step:6:779

mrustl commented 5 years ago

The MRAN url is wrong (missing DATE)! https://mran.microsoft.com/snapshot/src/contrib/PACKAGES (https://github.com/technologiestiftung/flusshygiene-opencpu-base/pull/8/checks?check_run_id=228084450#step:6:783)

Instead of e.g.: https://mran.microsoft.com/snapshot/2019-09-19/src/contrib/PACKAGES

ff6347 commented 5 years ago

The MRAN url is wrong (missing DATE)! https://mran.microsoft.com/snapshot/src/contrib/PACKAGES (https://github.com/technologiestiftung/flusshygiene-opencpu-base/pull/8/checks?check_run_id=228084450#step:6:783)

Instead of e.g.: https://mran.microsoft.com/snapshot/2019-09-19/src/contrib/PACKAGES

Okay. So it's just an issue with adding the date

mrustl commented 5 years ago

Yep

mrustl commented 5 years ago

Coo. Seems to work now, but @fabianmoronzirfas there no need to "secure" the date:

install.packages(c("remotes", "rstanarm", "sf", "fs", "raster", "sp", "lubridate", "httr", "Rcpp", "curl"), repo = 'https://mran.microsoft.com/snapshot/***' ) https://github.com/technologiestiftung/flusshygiene-opencpu-base/commit/ace1c21010be1a92f88b2a629e1cde239314a5d5/checks#step:6:771

ff6347 commented 5 years ago

Yeah. I actually hardcoded it for now into the Dockerfile everything else was failing… 😭

I didn't want to secure it. I wanted to pass it in from the outside as a variable so we can easily update it.

ff6347 commented 5 years ago

The build is now faster again. Takes 3min 40sec here on GH. I'll close this one.