nuest / ten-simple-rules-dockerfiles

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science
https://doi.org/10.1371/journal.pcbi.1008316
Creative Commons Attribution 4.0 International
61 stars 15 forks source link

Comments about rule 2: "Build upon existing images" #97

Open sdettmer opened 2 years ago

sdettmer commented 2 years ago

comments about rule 2: "Build upon existing images"

From my point of view this is a clear no-go. Normally we cannot know the state of an existing image – there were cases where people manually intercepted build processes or even manually "unpack&patch&repack" distributions.

It cannot be reliably verified on the image except building it, which essentially means: do not use existing images, but build everything for yourself from locally available data.

vsoch commented 2 years ago

Again, I will kindly disagree. There are a core set of base images (e.g., centos, ubuntu) provided by the primary maintainers that are updated with security patches, and that is much better practice than "rolling your own" which at best would be the same thing.

At least for Singularity recipes I have a small plot: https://singularityhub.github.io/singularity-catalog/bases/ and we can see this practice is followed.

sdettmer commented 2 years ago

@vsoch Thank you for your quick reply.

I'm afraid you only believe that these images are reproducible, but in fact they might have been changed (such as adding security packages) or were built using apt install (and used whatever accidentally was available at this day). If you build the same Dockerfile, you might get different results, such as a security fixed package (for a flaw impossible to exploit in your environment) but with a little new bug (breaking your application). Either it is guaranteed to have exactly the same input, or it is not reproducible.

Of course there are other requirements, such as updating to include security fixes, and surely in many cases the old results will not be needed to be reproduced, but when for example in ten years someone wants to verify why a result was incorrect, 100% exact the same content is needed - maybe a well hidden bug somewhere lead to a wrong result.

Of course reproducibility has a price, and often it is high. For example, when using images from maintainers, each must be stored locally.

Let's assume one officially maintained image was attacked and contained a backdoor. This backdoor leads to wrong result of the container operation and to an invalid conclusion of some research. To analyze whether the invalid conclusion was caused by bad scientific practices or even data manipulation, someone could redo the processing. In meantime the maintainers surely removed the backdoor, of course they do, what else could be expected. By this, the reason for the wrong result is removed and the container produces the correct result, different than before (i.e. not reproducing) and the researcher may get into trouble because some may think the invalid conclusion was done to look better in publications.

vsoch commented 2 years ago

I don’t actually care if they are perfectly reproducible - it’s almost guaranteed they are slightly different, however is my supply chain in secure (a work in progress but registries will care soon with SBOMs etc) and my container is tested and works as I need it to, this is a successful outcome.