rocker-org / rocker-versioned2

Run current & prior versions of R using docker. rocker/r-ver, rocker/rstudio, rocker/shiny, rocker/tidyverse, and so on.
https://rocker-project.org
GNU General Public License v2.0
413 stars 178 forks source link

Add information on recommended method of installing Bioconductor packages #292

Open pmoris opened 2 years ago

pmoris commented 2 years ago

I'm having some difficulties modifying the rocker tidyverse base image with Bioconductor packages. I've written up my goal, approach and problems more extensively in an issue on littler's github page (https://github.com/eddelbuettel/littler/issues/93), because I thought there was something strange going on with the --repository flag of the install2.r script, although that turned out to be a more low-level issue that can happen when mixing repos and thus has nothing to do with littler itself.

I'm posting this issue here however because I hope that the rocker community can provide some guidance on how to tackle the things I'd like to do, i.e. install bioconductor in (rocker) docker and have the build fail when something goes wrong. I believe that this information could be useful for other users and could be included on the Rocker Project's guide on extending the images.

Very briefly, here are my findings and struggles:

I believe that this issue is not tied to which specific repositories are being used (RSPM or the default bioconductor ones) and that it could be worthwhile to highlight it somewhere in rocker's guide on modifying and extending the images. E.g. warning users about potential silently failing bioconductor installs by calling BiocManager::install() and warning about verifying whether using install2.r with all the individual sub-repositories for bioconductor does what they intend it to do.

Am I going about this the wrong way perhaps? I guess I can just forget about pinning a specific repository and just keep track of my images as the unit I need to store for reproducibility? But then again, rocker images for previous versions of R also pin repo URLs, so that seems to be the intended approach. Any other advice or insight into how I can better handle these installations is highly appreciated!

EDIT: I've cross-posted this to the Bioconductor repository as well, since the same kind of addition to their documentation would be useful imo: https://github.com/Bioconductor/bioconductor_docker/issues/38

cboettig commented 2 years ago

Thanks, this mostly sounds accurate. The key distinction here is that, in my understanding, BioC packages are already frozen to the annual R version, much like Ubuntu and other Linux distros do with their default repositories. I believe the bioc installer selects the appropriate repository based on the R version, so the rocker-versioned approach here is basically to leave well enough alone. As we already freeze the R version, the corresponding BioC repo should be determined from that.

Let me know if that makes sense or if I'm missing something. (I'm not an active user of many BioC packages, so I could easily be missing something in my understanding here!)

Agree :100: that we ought to improve the docs about this in any case

eitsupi commented 2 years ago

i.e. install bioconductor in (rocker) docker and have the build fail when something goes wrong.

How about modifying the installBioc.r script to be like the install2.r script in order to make the build fail when the installation fails?

https://github.com/rocker-org/rocker-versioned2/blob/889a33b959a319f59acc634563f8d8eca8abbac0/scripts/bin/install2.r#L81-L84

pmoris commented 2 years ago

Thanks both for your replies!

How about modifying the installBioc.r script to be like the install2.r script in order to make the build fail when the installation fails?

If that is possible, I'd be stoked! It would solve the major problem I'm facing and also make the behaviour of the script more consistent with not just install2.r, but also installGithub.r!

The key distinction here is that, in my understanding, BioC packages are already frozen to the annual R version, much like Ubuntu and other Linux distros do with their default repositories. I believe the bioc installer selects the appropriate repository based on the R version, so the rocker-versioned approach here is basically to leave well enough alone. As we already freeze the R version, the corresponding BioC repo should be determined from that.

That does indeed make sense! I'm quite new to bioconductor myself (or rather, I've never had the need to delve into the way it managages packages), so here's what I've gathered just now:

In any case, from what I can tell, BiocManager (and BiocVersion) seem to work just fine regardless of whether the bioconductor or the RSPM repository is being used. I.e., users can install a desired version of bioconductor (and will be warned when they try to use a version that is incompatible with the available version of R), and the different repository URLs (BioCSoft, BioCAnn, etc.) will be adjusted automatically (using the repository URL prefix that is set by options("BioC_mirror")).

So all of that seems to work as intended and I agree with your "leave well enough alone" assessment ;) Apologies for writing out this wall of text, but at the very least it helped me get a better grip on things.

Since these specific peculiarities are pretty much unique to bioconductor, I understand that it's a bit difficult to gauge how much of it needs to be documented by the rocker project as opposed to by bioconductor though... Perhaps, the fact that BioC manager is installed in tidyverse, but that the default repository is retained, alongside a warning on how best to install BioC packages could be worthwhile additions?

nick-youngblut commented 1 year ago

The README currently states:

Please install R packages from source using the install.packages() R function or the install2.r script, and use apt only to install necessary system libraries (e.g. libxml2). Do not use apt install r-cran-* to install R packages.

It would be helpful to add info here on Bioconductor package installation (e.g., installBioc.r).

It would also be helpful to include information on how to install Bioconductor packages when installBioc.r is not in the PATH for the rocker image (e.g., r-ver:4.2.1).

cboettig commented 1 year ago

Thanks @nick-youngblut ! PR's always welcome, we're a community-driven project.

nick-youngblut commented 1 year ago

PR's always welcome, we're a community-driven project.

I can see why you'd like help, given how much of a pain writing documentation can be, but asking for help with documentation from those that are currently looking for the documentation seems like it will lead to documentation edits that do not incorporate best-practices, as defined by the software developers. For instance, I'm currently trying the following:

RUN install2.r --ncpus 2 --error \
    argparse ape dplyr tidyr BiocManager && \
  R -e 'BiocManager::install("sangeranalyseR")' && \
  rm -rf /tmp/downloaded_packages

...but I don't know if it will work (the build is still running) or if it follows best-practices. If it does work, I can create a PR with an updated README, but I'm guessing the person(s) reviewing the PR will just have to heavily edit the changes.

cboettig commented 1 year ago

Hey @nick-youngblut , thanks! yup, a PR is a great way for us a community to discuss these things! This is not just because I am too lazy to update the readme, but because that discussion process of issues and PRs usually gets us to a better point that meets the needs of other users, and is also easier for other developers and community members to chime in.

I agree with you that installBioc.r is probably the best choice for most users, and we should probably start by documenting that more clearly!

Like you note, that's not so helpful since unlike install2.r or installGithub.r, it's not sym-linked onto the default PATH. These helper utilities are part oflittler, so it's available in$R_HOME/site-library/littler/examples/installBioc.r` -- and we should probably symlink it in https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/setup_R.sh#L78 I think.

eddelbuettel commented 1 year ago

So I was intrigued to see how far r2u could come in help given its partial BioConductor support (and of course famously complete CRAN support). I fired up the eddelbuettel/r2u:jammy container (to be ported to Rocker "soon") and did

# first command an echo of yours, installs in a few (single) seconds
install.r  argparse ape dplyr tidyr BiocManager
# the I tried this which came back with a loooong list of packages so I Ctrl-C'ed out
#Rscript -e 'bspm::disable(); BiocManager::install("sangeranalyseR")'
# instead this installed all available build-deps
# (I had edited the '' and , out of the return from the stopped attempt
install.r sys bitops bit colorspace askpass zlibbioc RCurl GenomeInfoDbData bit64 blob memoise plogr isoband farver labeling munsell curl openssl BH fs rappdirs pixmap sp RcppArmadillo BiocGenerics S4Vectors IRanges XVector GenomeInfoDb crayon RSQLite DBI plyr fastmatch igraph quadprog gtable httpuv mime xtable fontawesome htmltools sourcetools later promises fastmap commonmark bslib cachem ellipsis ggplot2 scales httr viridisLite base64enc htmlwidgets RColorBrewer lazyeval crosstalk jquerylib anytime sass zip evaluate tinytex xfun yaml highr ade4 segmented bookdown Biostrings DECIPHER reshape2 phangorn sangerseqR gridExtra shiny shinydashboard shinyjs data.table plotly DT zeallot excelR shinycssloaders ggdendro shinyWidgets openxlsx rmarkdown knitr BiocStyle logger
# then I could just do -- which was quick
Rscript -e 'bspm::disable(); BiocManager::install("sangeranalyseR")'

Now all is good:

> library(sangeranalyseR)                                                                                
Loading required package: stringr                   
Loading required package: ape                       
Loading required package: Biostrings                                                                     
Loading required package: BiocGenerics
[.... lots and lots omitted ...]
Loading required package: logger
Welcome to sangeranalyseR
> 

It uses current packages, not the 'versioned' stack so it may not be of interest to you. But we can get a of BioC quickly installed, which is still of interest to some.

nick-youngblut commented 1 year ago

Apparently, my attempt above does not work. I was able to install the sangeranalyseR package via R -e 'BiocManager::install("sangeranalyseR")', aand the docker image build completed successfully. However, when I try to load the R package in my R script within the image, I get the error:

Error in library("sangeranalyseR") : 
  there is no package called ‘sangeranalyseR’

...so it appears that the bioconductor package is not installed in the correct libPath. My libPaths when calling the R script:

"/usr/local/lib/R/site-library"
"/usr/local/lib/R/library"

I cannot find the "installed" sangeranalyseR package anywhere in the docker image. The following returns nothing:

find / -iname "sangeranalyseR" 2> /dev/null

...and the package is definitely not in /usr/local/lib/R/site-library/.

The entire docker file that I'm using:

FROM ubuntu:20.04
FROM rocker/r-ver:4.2.1

# Install OS dependencies
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y \
      build-essential

# Install R dependencies
RUN install2.r --ncpus 2 --error \
    argparse ape dplyr tidyr ggplot2 purrr furrr data.table tidytable BiocManager && \
  R -e 'BiocManager::install("sangeranalyseR")' && \
  rm -rf /tmp/downloaded_packages

# CMD
CMD ["/bin/bash", "-c", "R --version"]
eddelbuettel commented 1 year ago

:man_shrugging:

What I showed you was real. I just used eddelbuettel/r2u:jammy as the base. It does not have those .libPaths(). If you are in a different environment you need to debug what is different.

(I also tried to throw a quick demo Dockerfile together (just as I had already done once today) but that balked as @enchufa2 and I currently have an issue with bspm where it is not as smoothly falling over from some packages not in the repo. Your laundry list of packages implied is really long. It worked for interactively, in building a Dockerfile it balked. Sorry. r2u is real though: I encourage you to play a little. We have 20k CRAN packages, and about 240 BioC. So you can go a long way.)

eddelbuettel commented 1 year ago

Well sure if you use rocker/r-ver than none of this applies. I tried to say so in my first message.

cboettig commented 1 year ago

@nick-youngblut I suspect your installation isn't succeeding due to missing system libraries (might be apt-get install -y zlib1g-dev libxml2-dev libglpk-dev) you'll need to list on your Dockerfile (r2u does this magically :magic_wand: , but r-ver does not. you could use a more downstream member of the r-ver that includes more of these dependencies by default though)

Recall that R does not throw an error when install.packages() fails. (note that like install2.r, installBioc provides the --error flag to alter this behavior, which is imperfect but usually best in Dockerfiles)

cboettig commented 1 year ago

For instance, this Dockerfile works for me: (though it does take 330 seconds to build)

FROM rocker/verse

# Install R dependencies
RUN install2.r --ncpus 2 --error \
    argparse ape dplyr tidyr ggplot2 purrr furrr data.table tidytable BiocManager && \
  $R_HOME/site-library/littler/examples/installBioc.r --error sangeranalyseR && \
  rm -rf /tmp/downloaded_packages
nick-youngblut commented 1 year ago

Thanks @eddelbuettel and @cboettig for all of the help! ...and thanks @cboettig for test-building a dockerfile that works 🚀

@cboettig , is your use of $R_HOME/site-library/littler/examples/installBioc.r the current best-practice that I should include in my PR to update the docs?

though it does take 330 seconds to build

FYI: it took 1384 sec to build the RUN install2.r ... layer on my M1 macbook, and the image is 1845.62 MB

eddelbuettel commented 1 year ago

r2u does this magically

Not really. r2u relies on binaries and has them for all of CRAN. I did build sangeranalysisR from source because that one is not among the ~ 240 BioC binaries in r2u.

There are also some BioC folks already using / poking at r2u so you could ask on the BioC slack or lists too for best practices.

As for installBioc.r, I have several dozen scripts in that littler directory including half a dozen installation helpers. We don't promote all into the path but maybe should. Easy enough for you to add too.

eddelbuettel commented 1 year ago

So for completeness, now after dinner, with the following Dockerfile

FROM eddelbuettel/r2u:jammy

## depends per https://www.bioconductor.org/packages/release/bioc/html/sangeranalyseR.html
RUN install.r argparse stringr ape Biostrings DECIPHER reshape2 phangorn gridExtra \
    shiny shinydashboard shinyjs data.table plotly DT zeallot excelR shinycssloaders ggdendro \
    shinyWidgets openxlsx rmarkdown knitr seqinr BiocStyle logger BiocManager

## now our main target
RUN Rscript -e 'bspm::disable(); BiocManager::install(c("sangerseqR", "sangeranalyseR"))'

we install in 64 seconds.

image

eitsupi commented 1 year ago

FYI: it took 1384 sec to build the RUN install2.r ... layer on my M1 macbook, and the image is 1845.62 MB

Arm64 platform does not support binary installation of CRAN packages, so installation takes longer. https://rocker-project.org/images/versioned/r-ver.html#overview

nick-youngblut commented 1 year ago

Arm64 platform does not support binary installation of CRAN packages, so installation takes longer.

I ran the build for linux/amd64:

docker buildx build --push --platform linux/amd64 -t ${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${IMAGE_NAME}:${IMAGE_VERSION} ${IMAGE_NAME}