rocker-org / rocker

R configurations for Docker
https://rocker-project.org
GNU General Public License v2.0
1.46k stars 271 forks source link

Alternative Methods of Extending Images? #550

Open emstruong opened 7 months ago

emstruong commented 7 months ago

Hello,

I was looking at the guide for extending rocker images and I noticed that apt-get update and apt-get install... are necessary for installing certain packages. I was wondering if it'd be feasible for the/some rocker image to simply have all the possible system libraries already installed?

Please correct me if I'm wrong--but my understanding is that having that requiring apt-get operations in the Dockerfile risks making the Docker images non-reproducible if the latest version of the system library changes in some way. And if the system library changes in a major way, then the non-reproducibility could be quite severe...

eddelbuettel commented 7 months ago

One usage pattern is to build (or extend) a container at the time you want to snapshot, tag it appropriately --- and then in the future access that container image. Instead of hoping to remake it identically in the future.

emstruong commented 7 months ago

One usage pattern is to build (or extend) a container at the time you want to snapshot, tag it appropriately --- and then in the future access that container image. Instead of hoping to remake it identically in the future.

While true, in my mind, my feeling is that long-term storage of container images could be non-trivial... Whereas most people can be expected to keep the dockerfile (+ Renv) that they use.

Is there a warning that images should be stored or else reproducibility may be broken in the future?

cboettig commented 7 months ago

Rocker's "versioned stack" builds on Ubuntu LTS for all system libraries, while the rocker/r-base builds on rolling debian:testing. Please note this results in substantial differences in how system libraries behave. A rolling tag like "testing" in Debian implies that system libraries are being regularly updated. This is entirely different to how system libraries are updated in codenamed releases in almost any linux distro, including Debian or Ubuntu LTS releases. While these releases receive "patches", i.e. bug fixes and security patches, but not "new features" -- they don't receive major new versions of software. Ubuntu 22.04 is going to have the same base version of gcc that it came out with on April 2022, it will be at that version for the next 10 years. Obviously there is some wiggle room in what is a bug fix to one user may be a breaking change to another user, but by-in-large the ability to provide stable versions distributions of software that do not create breaking changes is precisely what Linux distributions have done for some three decades now and they are pretty good at it. The difference between a "patch" that isn't expected to break anything and an 'update' is essentially the whole reason there are Linux releases. (I think this concept has become somewhat lost as most users today are familiar with dependency management from the perspective of CRAN, PyPi, or conda, where there is no notion of a 'distribution', but individual packages are updated on a constantly rolling basis. It is worth noting that BioConductor does follow the linux release model, where all packages in bioconductor update their version at precisely the same date once a year. When a maintainer submits an update it isn't sent to the community as soon as it is approved, it is instead 'scheduled' for the next release).

This isn't the same as being bitwise frozen -- it will receive "security patches", so if for those whose definitions of 'reproducible' is "I need to be able to reproduce the precise behavior including vulnerabilities and bugs", the only solution is to store the image and never rebuild it.

Note that it would not only be unreasonably large to include 'every possible system library', but is also not technically possible. For instance, the repositories frequently contain different versions of software that are incompatible, installing one system library can cause another library to be uninstalled.

eitsupi commented 7 months ago

If you need an image that includes almost all system libraries, you could use the R-universe or R-hub builder images. If you fix these with a hash (do not use tags), you should get the same version of the system libraries every time.

emstruong commented 7 months ago

@cboettig That makes sense to me--so if I understand what you're saying, it's a mix of likely-not a concern + not possible to solve for the versioned stack of rocker, right?

Either way, maybe the documentation could be updated with a note regarding why storing the image might be necessary?