pharmaR / regulatory-r-repo-wg

Package consensus for regulated industries
https://pharmar.github.io/regulatory-r-repo-wg
27 stars 3 forks source link

REQ Reference images #77

Open Crosita opened 7 months ago

Crosita commented 7 months ago
dgkf commented 7 months ago

From discussion at 2024-03-21 team meeting:

yonicd commented 7 months ago

@dgkf which one of these images to we want to fork/build from? imo r-minimal would be easiest to justify.

yonicd commented 7 months ago

Another option could be a rocker based image

This is what I use for my regular work environment (obviously pruning out the radian/RS pieces for a non-interactive runner)

FROM rocker/r-ver:4.3.1

ENV S6_VERSION=v1.21.7.0
ENV RSTUDIO_VERSION=stable
ENV PATH=/usr/lib/rstudio-server/bin:$PATH

RUN export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends 

# install key dependencies of certain packages that could be installed later
RUN apt-get update \
    && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends \
    apt-utils dialog libpq5 openssh-client openssh-server\
    wget libxml2-dev libsecret-1-dev libsodium-dev \
    libssl-dev imagemagick libmagick++-dev \
    libgit2-dev libssh2-1-dev zlib1g-dev librsvg2-dev \
    libudunits2-dev libfontconfig1-dev libfreetype6-dev \
    gdal-bin proj-bin libgdal-dev libproj-dev libgmp3-dev \
    libmpfr-dev libzmq3-dev cmake build-essential \
    glpk-utils libglpk-dev glpk-doc libtbb2 htop \
    libpoppler-cpp-dev curl libharfbuzz-dev libfribidi-dev \
    python3-setuptools python3-pip

# install tinytex
RUN wget -qO- "https://yihui.org/tinytex/install-bin-unix.sh" | sh

# install radian via python and pip3
RUN pip3 install radian

# Copy the updated rocker installation scripts
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/install_rstudio.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/init_set_env.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/init_userconf.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/pam-helper.sh
RUN wget -N -P /rocker_scripts/ https://raw.githubusercontent.com/rocker-org/rocker-versioned2/master/scripts/install_pandoc.sh

# Set permissions on installation scripts
RUN chmod +x /rocker_scripts/install_rstudio.sh
RUN chmod +x /rocker_scripts/init_set_env.sh
RUN chmod +x /rocker_scripts/init_userconf.sh
RUN chmod +x /rocker_scripts/pam-helper.sh
RUN chmod +x /rocker_scripts/install_pandoc.sh

# Install rocker studio
RUN /rocker_scripts/install_rstudio.sh
RUN /rocker_scripts/install_pandoc.sh

EXPOSE 8787

CMD ["/init"]
dgkf commented 7 months ago

My only concern with r-minimal is that we’d have to juggle system libraries.

To avoid having to manage the governance of system libraries, I was thinking we might use something like debian-gcc-release, although I can’t tell just from the dockerhub page what system libraries that comes with. At least for our POC, I’m hoping we can defer to a community image that tries to replicate the CRAN systems as closely as possible. I think I remember reading that that was a goal of some of the R-Hub containers.

Well also probably want a Windows solution, since that’s the OS we know health authorities (namely the FDA) use.

Whether we want to manage a container that accommodates some industry objectives in the future will remain to be seen. So far any feedback we’ve gotten about a community reference image has ranged from enthusiastic to, at worst, very vague reluctance (ie too dissimilar from what would be used within an organization). I’d prefer to avoid engineer too heavily around it until we can get more targeted feedback.

This is what I use for my regular work environment

That’s awesome! I’ve been running a similar rocker/r-ver image as a drop in replacement for my system R install for a few years too. Pre-rig it was an amazing resource for quickly testing on R-devel just by dropping into a different container. At some point I need to try out rig to see if I can be swayed away from my container setup.

yonicd commented 7 months ago

Agree. I was poking around the other containers and finally got to the master template for debian. It has listed all the sys libs they install.

Also note they import debian:testing as their base layer, which is a curious choice :)

yonicd commented 7 months ago

@dgkf if we do go with docker hub to host this then we would need to open an org in that environment to map to a github repo.

dgkf commented 7 months ago

Also note they import debian:testing as their base layer, which is a curious choice :)

Yes... we'd definitely want to change that at the very least.

Also check out the r-hub/containers repo. Just based on activity, I think development has moved here. I don't see a debian image, but they do have an ubuntu-gcc12 (that has a more sensible base layer of a LTS ubuntu release), which is built with this packages.ubuntu-gcc12 set of system libraries.

There's a really nice dashboard of all the dependencies (OS, R capabilities, compilers, system libraries, etc) here, which is a pretty amazing asset for communicating reproducibility in our initial pilots if this is an acceptable direction.

Crosita commented 6 months ago

@wiligl - Would be great to get your input

wiligl commented 6 months ago

Great input! Please find my comments below acknowledging that I am less technically involved than others and, therefore, I may have misunderstood and misrepresented some concepts. Happy to discuss further!

The first "proof of concept" reference image should focus a minimal set of validated ("approved") R packages. I would, therefore, use existing user-friendly, cross-platform (Windows, Linux/Debian or Linux/Ubuntu) image including R and Rstudio with fixed version numbers (not latest) for R and Rstudio. The image config file can then be adapted to install the minimal set of validated R packages. Using an Rstudio/R image based on the R Studio Package Manager has the advantage of using binary package which not require system libraries for building under Linux. Binary packages are also used for installation under Windows. If the R packages database (ie, validated package info) and the R packages repository (ie, package code and binaries) are available, the Rstudio/R image just needs to set the default repository to our validated repository (at least for the concerned packages). At present, I would prefer a rocker image over an rhub image because i am more familiar with docker and have tested the rocker/rstudio image myself.

Regarding the questions raised above:

  1. Operating system:
    • Windows 10 Enterprise edition: seems to be current corporate and regulatory standard, please note that Windows 10/11 also includes the Windows Subsystem for Linux (WSL) [2] which may allow running original Linux applications (not tested). If so, our group may only need to provide a Linux image.
    • Debian 12 ("Bookworm", LTS): assuming that debian-based distributions, such as Ubuntu, will be compatible
  2. System libraries: This is a hard problem for which there is no perfect, validated solution. It seems the Rstudio Package Manager works with community-maintained rules to install required system dependencies. It seems basing the Package repository on binary packages will avoid the problem of system libraries. If buildling from source is preferred, I would use existing solutions like the Rstudio package manager.
  3. Virtualization: Docker technology seems to be the convention.
  4. Image as Code vs Binary: At the moment, I would prefer code which is easier to distribute and maintain (but installation of binary packages).
  5. Rocker vs Rhub images: I prefer the rocker/rstudio image, which I have tested. Rhub not tested so far. Please note that Rstudio server installations with docker may cause conflicts of port number 8787 with existing Rstudio server installations.
  6. Set of packages: Minimal for starters, I would set the package repo to the groups pharma package repo as default, so the user can decide which packages to install.
  7. Number and type of images: This will be a compromise between the resources of the group and the target audience. Development of R is clearly driven in a linux environment, but Windows is most commonly used. I would use Debian 12 LTS as base image and would test whether it can be also used with the Windows Subsystem for Linux (WSL). If not, a Windows 10 Enterprise image will be required.

Hope this helps!

Refs: [1] https://rocker-project.org/use/extending.html "Posit Package Manager (Formerly “RStudio Package Manager”, RSPM) provides binary R packages for specific Linux distributions" [2] https://learn.microsoft.com/en-us/windows/wsl/about

wiligl commented 6 months ago

This tutorial describes how to install Rstudio server on a Windows 11 host system by installing it as a Docker image in Ubuntu 22.04 LTS as virtual machine running on the Windows Subsystem for Linux (WSL2) as a guest system. This way the complete functionality of the Linux operating system will be available for Rstudio in the web browser (eg, MS Edge) under Windows 11.

Link: https://wilmarigl.de/?p=911

mmengelbier commented 5 months ago

There were a few presentations at PharmaSUG 2024 that were interesting. Two in particular from members of the R Submission wg. One important bit from Pilot 4 is that, together with the FDA, they want to try a container for submissions containing R. The container would either have shiny (reviewers app) or Workbench plus whatever packages and content that is part of the sponsor submission.

A second interesting rumor (not seen any confirmation) was that the FDA would accept Docker rather than the original idea of podman.

This brings up an interesting question if our wg image can be the standard image for use in submissions. From an industry perspective, I see some direct benefits.

  1. There is one industry community standard container image as base for submissions that could be agreed with FDA and other regulatory authorities.
  2. Our wg concept of validated packages could be included in the standard image, fully keeping in mind that there will be unused packages and there will be packages missing. But the less additional effort would only be a benefit to all
  3. The list of wg “validated” packages could be based on contributor submissions instead of us in the wg trying to predict. That would especially be true for stats packages. But, I think we could all produce a list of around 200 packages we would expect.

I use the term validated in quotes simply because it is each individual organization that uses the R packages to decide what validated means in their context (ICH definition of validated). If our wg approach is to provide independent documentation that can be used as the documented evidence that they can accept for their needs, this could mean that the wg deliverables would greatly reduce the burden on each organization. Add onto that the packages are validated for the container that can be used as part of a submission, there in itself is the golden reference we have been discussing.

Considering the above, I would like to add to @wiligl comments.

  1. Looking across organizations, definitely Windows something and Linux. Unfortunately, Linux is a bit of a challenge, The last 6 months I have seen a convergence on Ubuntu and Rocky but I can also see an argument for Debian. Given that most Sponsors are on Linux with a few on Windows, I think this will be a natural discussion at some point.
  2. There is no way to avoid system dependencies on packages. They will always be tied to the package level for quite some time to come. Also, keeping around unused system libraries has been a cause for producing incorrect results.
  3. See above note on the FDA perspective
  4. From experience, source on Linux and binary installs on Windows. I would prefer source on both to avoid the headache or tool chain issues which can be a pain to spot and resolve
  5. Given the need for documented builds of an image if we want to support the R submission wg, it is probably our wg or a separate wg to do validated builds
  6. as noted above, I agree to start small. We could divide packages into two lists, those that are lower risk and would only need tests provided with the package to run (thinking of the tidyverse packages) and those more critical packages where additional testing may be required. The latter is a bit sticky as it usually require input from experts.
  7. The number of images would be at least 3 if we use the following strategy.
    • Image 1 is a qualified build with R, with qualified simply meaning the build process is well documented and reproducible
    • Image 2 is based on image 1 and contains the wg tooling to perform “validation” runs on packages. Image 2 is used as the “runtime” to validate a package, generate its documentation, etc.
    • Image 3 contains wg validated packages with the addition of shiny and workbench, both using qualified installs along with the principles of image 1.

Image 1 is really only created when the wg adopts a new R release. Image 2 is used in standard workflows and Image 3 are on a release schedule. All uses of the images can be automated.

HTH

References Piloting into the Future: Publicly available R-based Submissions to the FDA (https://www.pharmasug.org/proceedings/2024/SS/PharmaSUG-2024-SS-344.pdf) Experimenting with Containers and webR for Submissions to FDA in the Pilot 4 (https://www.pharmasug.org/proceedings/2024/SS/PharmaSUG-2024-SS-376.pdf )

wiligl commented 5 months ago

@mmengelbier The stacked image approach makes sense and follows the rocker strategy. However, maintaining multiple images will add work, I am wondering whether image 1+2+3 as one big image would suffice.

mmengelbier commented 5 months ago

@wiligl , technically the end user usually only sees the "last" image that represents the functionality that they want. It really comes down to efficiency and documentation, and I am fully assuming that with any image and package "validation" we are intending to share documentation that could be used by Sponsors and agencies as evidence of compliance. If we do not share that documentation, we are not really adding any benefit because each Sponsor and agency would have to do that anyway.

The rocker approach makes sense because there are different "end-user" images for R, i.e. Workbench, shniy, etc. So build R once and use everywhere.

If we perform all tasks within one big image that would mean we would recreate all the documentation every time, both the build, any "validation" documentaiton and documentation that would qualify the install. We would also rerun some time consuming tasks as well that may be unnecessary. As we are also performing qualified installs (Installation Qualification/IQ) and validation in the same run, the resulting documentaiton would have to clearly demonstrate the different tasks.

If we instead we create Image 1 as a minimal with just R, this would only be done once a release is adopted. The build is also very easy to document as it would only have to document the build steps and result.

Image 2 is Image 1 with the wg tools to test. Given that wg + community tooling will most probably be ipdated more often than R or any subsequent image used in the subisssions should not retain the wg tools. This is also probably just a simple IQ.

Image 2 is then used as source to perform and document testing for each individual package or package cohort. There would be no need to document the build as it was already completed in generating Image 1 or install the wg tools as that is building Image 2..

If the wg agrees, Image 1 or 2 can then be used to create an Image 3 where all the "validated" packages are installed along with interfaces such Workbench and/or shiny. It could make sense to do Image 3-workbench, Image 3-shiny and Image 3-workbench+shiny knowing that Image 3 is litterally automated installs with corresponding IQ documentaiton. Using Image 1 or 2 would come down to if wg tools and utilities would be included in Image 3+.

From experience, it is actually far easier and much more efficient to do multiple images rather than one big image. From a compliance point of view, I personally advocate for multiple images with simplified documentation rather than the more complex single large images.

HTH