Open Crosita opened 7 months ago
From discussion at 2024-03-21 team meeting:
@dgkf which one of these images to we want to fork/build from? imo r-minimal
would be easiest to justify.
Another option could be a rocker based image
This is what I use for my regular work environment (obviously pruning out the radian/RS pieces for a non-interactive runner)
FROM rocker/r-ver:4.3.1
ENV S6_VERSION=v1.21.7.0
ENV RSTUDIO_VERSION=stable
ENV PATH=/usr/lib/rstudio-server/bin:$PATH
RUN export DEBIAN_FRONTEND=noninteractive \
&& apt-get -y install --no-install-recommends
# install key dependencies of certain packages that could be installed later
RUN apt-get update \
&& export DEBIAN_FRONTEND=noninteractive \
&& apt-get -y install --no-install-recommends \
apt-utils dialog libpq5 openssh-client openssh-server\
wget libxml2-dev libsecret-1-dev libsodium-dev \
libssl-dev imagemagick libmagick++-dev \
libgit2-dev libssh2-1-dev zlib1g-dev librsvg2-dev \
libudunits2-dev libfontconfig1-dev libfreetype6-dev \
gdal-bin proj-bin libgdal-dev libproj-dev libgmp3-dev \
libmpfr-dev libzmq3-dev cmake build-essential \
glpk-utils libglpk-dev glpk-doc libtbb2 htop \
libpoppler-cpp-dev curl libharfbuzz-dev libfribidi-dev \
python3-setuptools python3-pip
# install tinytex
RUN wget -qO- "https://yihui.org/tinytex/install-bin-unix.sh" | sh
# install radian via python and pip3
RUN pip3 install radian
# Copy the updated rocker installation scripts
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/install_rstudio.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/init_set_env.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/init_userconf.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/pam-helper.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/install_pandoc.sh
# Set permissions on installation scripts
RUN chmod +x /rocker_scripts/install_rstudio.sh
RUN chmod +x /rocker_scripts/init_set_env.sh
RUN chmod +x /rocker_scripts/init_userconf.sh
RUN chmod +x /rocker_scripts/pam-helper.sh
RUN chmod +x /rocker_scripts/install_pandoc.sh
# Install rocker studio
RUN /rocker_scripts/install_rstudio.sh
RUN /rocker_scripts/install_pandoc.sh
EXPOSE 8787
CMD ["/init"]
My only concern with r-minimal
is that we’d have to juggle system libraries.
To avoid having to manage the governance of system libraries, I was thinking we might use something like debian-gcc-release
, although I can’t tell just from the dockerhub page what system libraries that comes with. At least for our POC, I’m hoping we can defer to a community image that tries to replicate the CRAN systems as closely as possible. I think I remember reading that that was a goal of some of the R-Hub containers.
Well also probably want a Windows solution, since that’s the OS we know health authorities (namely the FDA) use.
Whether we want to manage a container that accommodates some industry objectives in the future will remain to be seen. So far any feedback we’ve gotten about a community reference image has ranged from enthusiastic to, at worst, very vague reluctance (ie too dissimilar from what would be used within an organization). I’d prefer to avoid engineer too heavily around it until we can get more targeted feedback.
This is what I use for my regular work environment
That’s awesome! I’ve been running a similar rocker/r-ver
image as a drop in replacement for my system R install for a few years too. Pre-rig
it was an amazing resource for quickly testing on R-devel just by dropping into a different container. At some point I need to try out rig
to see if I can be swayed away from my container setup.
Agree. I was poking around the other containers and finally got to the master template for debian. It has listed all the sys libs they install.
Also note they import debian:testing
as their base layer, which is a curious choice :)
@dgkf if we do go with docker hub to host this then we would need to open an org in that environment to map to a github repo.
Also note they import
debian:testing
as their base layer, which is a curious choice :)
Yes... we'd definitely want to change that at the very least.
Also check out the r-hub/containers
repo. Just based on activity, I think development has moved here. I don't see a debian
image, but they do have an ubuntu-gcc12
(that has a more sensible base layer of a LTS ubuntu release), which is built with this packages.ubuntu-gcc12
set of system libraries.
There's a really nice dashboard of all the dependencies (OS, R capabilities, compilers, system libraries, etc) here, which is a pretty amazing asset for communicating reproducibility in our initial pilots if this is an acceptable direction.
@wiligl - Would be great to get your input
Great input! Please find my comments below acknowledging that I am less technically involved than others and, therefore, I may have misunderstood and misrepresented some concepts. Happy to discuss further!
The first "proof of concept" reference image should focus a minimal set of validated ("approved") R packages. I would, therefore, use existing user-friendly, cross-platform (Windows, Linux/Debian or Linux/Ubuntu) image including R and Rstudio with fixed version numbers (not latest) for R and Rstudio. The image config file can then be adapted to install the minimal set of validated R packages. Using an Rstudio/R image based on the R Studio Package Manager has the advantage of using binary package which not require system libraries for building under Linux. Binary packages are also used for installation under Windows. If the R packages database (ie, validated package info) and the R packages repository (ie, package code and binaries) are available, the Rstudio/R image just needs to set the default repository to our validated repository (at least for the concerned packages). At present, I would prefer a rocker image over an rhub image because i am more familiar with docker and have tested the rocker/rstudio image myself.
Regarding the questions raised above:
Hope this helps!
Refs: [1] https://rocker-project.org/use/extending.html "Posit Package Manager (Formerly “RStudio Package Manager”, RSPM) provides binary R packages for specific Linux distributions" [2] https://learn.microsoft.com/en-us/windows/wsl/about
This tutorial describes how to install Rstudio server on a Windows 11 host system by installing it as a Docker image in Ubuntu 22.04 LTS as virtual machine running on the Windows Subsystem for Linux (WSL2) as a guest system. This way the complete functionality of the Linux operating system will be available for Rstudio in the web browser (eg, MS Edge) under Windows 11.
There were a few presentations at PharmaSUG 2024 that were interesting. Two in particular from members of the R Submission wg. One important bit from Pilot 4 is that, together with the FDA, they want to try a container for submissions containing R. The container would either have shiny (reviewers app) or Workbench plus whatever packages and content that is part of the sponsor submission.
A second interesting rumor (not seen any confirmation) was that the FDA would accept Docker rather than the original idea of podman.
This brings up an interesting question if our wg image can be the standard image for use in submissions. From an industry perspective, I see some direct benefits.
I use the term validated in quotes simply because it is each individual organization that uses the R packages to decide what validated means in their context (ICH definition of validated). If our wg approach is to provide independent documentation that can be used as the documented evidence that they can accept for their needs, this could mean that the wg deliverables would greatly reduce the burden on each organization. Add onto that the packages are validated for the container that can be used as part of a submission, there in itself is the golden reference we have been discussing.
Considering the above, I would like to add to @wiligl comments.
Image 1 is really only created when the wg adopts a new R release. Image 2 is used in standard workflows and Image 3 are on a release schedule. All uses of the images can be automated.
HTH
References Piloting into the Future: Publicly available R-based Submissions to the FDA (https://www.pharmasug.org/proceedings/2024/SS/PharmaSUG-2024-SS-344.pdf) Experimenting with Containers and webR for Submissions to FDA in the Pilot 4 (https://www.pharmasug.org/proceedings/2024/SS/PharmaSUG-2024-SS-376.pdf )
@mmengelbier The stacked image approach makes sense and follows the rocker strategy. However, maintaining multiple images will add work, I am wondering whether image 1+2+3 as one big image would suffice.
@wiligl , technically the end user usually only sees the "last" image that represents the functionality that they want. It really comes down to efficiency and documentation, and I am fully assuming that with any image and package "validation" we are intending to share documentation that could be used by Sponsors and agencies as evidence of compliance. If we do not share that documentation, we are not really adding any benefit because each Sponsor and agency would have to do that anyway.
The rocker approach makes sense because there are different "end-user" images for R, i.e. Workbench, shniy, etc. So build R once and use everywhere.
If we perform all tasks within one big image that would mean we would recreate all the documentation every time, both the build, any "validation" documentaiton and documentation that would qualify the install. We would also rerun some time consuming tasks as well that may be unnecessary. As we are also performing qualified installs (Installation Qualification/IQ) and validation in the same run, the resulting documentaiton would have to clearly demonstrate the different tasks.
If we instead we create Image 1 as a minimal with just R, this would only be done once a release is adopted. The build is also very easy to document as it would only have to document the build steps and result.
Image 2 is Image 1 with the wg tools to test. Given that wg + community tooling will most probably be ipdated more often than R or any subsequent image used in the subisssions should not retain the wg tools. This is also probably just a simple IQ.
Image 2 is then used as source to perform and document testing for each individual package or package cohort. There would be no need to document the build as it was already completed in generating Image 1 or install the wg tools as that is building Image 2..
If the wg agrees, Image 1 or 2 can then be used to create an Image 3 where all the "validated" packages are installed along with interfaces such Workbench and/or shiny. It could make sense to do Image 3-workbench, Image 3-shiny and Image 3-workbench+shiny knowing that Image 3 is litterally automated installs with corresponding IQ documentaiton. Using Image 1 or 2 would come down to if wg tools and utilities would be included in Image 3+.
From experience, it is actually far easier and much more efficient to do multiple images rather than one big image. From a compliance point of view, I personally advocate for multiple images with simplified documentation rather than the more complex single large images.
HTH