nuest / ten-simple-rules-dockerfiles

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science
https://doi.org/10.1371/journal.pcbi.1008316
Creative Commons Attribution 4.0 International
64 stars 15 forks source link

comments about rule 8: "Make the image one-click runnable" #103

Open sdettmer opened 2 years ago

sdettmer commented 2 years ago

comments about rule 8: "Make the image one-click runnable"

To be reproducible, the exact data needs to be part of. So it could just be processed during the build, that actually is what reproducibility is all about (it does not matter when something is build or processed, the result it always exactly the same). If there is someone that needs to click, it is not the maximum automation, and if there is something to click with, the environment probably is not best suited for reproducible builds. So it is possible to automate until no clicks are needed, because the results are already there - this is most reproducible. For new data a new image (or result) can be created (automatically).

vsoch commented 2 years ago

This is what workflow managers are for, for which many use containers. This paper is scoped to just talking about containers.

sdettmer commented 2 years ago

@vsoch Thank you for your quick reply. The document is called "Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science" and I think it is clear that it is most simple and best reproducible to include the data in the dockerfile (as discussed at rule 7), and if so, the result can also be included in the docker file, and if so, it must not even be ran. By this, it cannot be run wrongly, which can be an advantage in corner cases. Of course, other requirements such as maintainability may force to separate container images and processing data, thus preventing storing results in the container, but then this rule should be in the "Ten Simple Rules for Writing Dockerfiles for Maintainable Data Science" document :)

vsoch commented 2 years ago

Yes, but if your data is 7TB you aren’t going to put it in a container. That statement applies to small data only (which to be fair, is quite a lot). If there are identifiers in the data you also couldn’t easily share it publicly. So it’s not always possible or feasible to do so.

sdettmer commented 2 years ago

@vsoch Yes, I see, but the rule requires that even small data (that could be easily shared) still must not be stored inside the container but mounted, doesn't it? So I think it is like "If data is larger, (unfortunately) it cannot be stored in the container, so it can be only mounted at run time". (I see that for maintainability it probably is better to mount smaller data as well, especially assuming that it is available in some archive anyway.)

vsoch commented 2 years ago

The rule does not explicitly state that - it targets "large" datasets time and time again, and suggests that small are OK (and the point could have been made more clear).