nuest / ten-simple-rules-dockerfiles

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science
https://doi.org/10.1371/journal.pcbi.1008316
Creative Commons Attribution 4.0 International
64 stars 15 forks source link

Discussion: Rule 1 - Don't write Dockerfiles by hand #8

Closed psychemedia closed 4 years ago

psychemedia commented 5 years ago

Whilst repo2docker can be used to generate Dockerfiles, I'm pretty sure I read somewhere that they don't recommend it?

Ah... here.

betatim commented 5 years ago

Maybe the point is too subtle/needs to be made in a positive way: use a tool to create the container image directly. Aka don't first write a Dockerfile by hand and then build the container image from that.


r2d currently does create a Dockerfile which it then turns into an image. However we consider that an implementation detail and hence tell people that r2d isn't a tool for generating Dockerfiles. We want to keep the door open for implementing other methods for going from repo to container image without first creating a Dockerfile. This is a hedge for when the docker CLI stops being the tool of choice for creating images. Currently all alternatives are "beta" though.

psychemedia commented 5 years ago

Something to perhaps tease out of this:

I'm not sure if holepunch generates a Dockerfile that is intended to stand on it's own?

I'm also aware of stencila's dockta, though off the top of my head I don't recall if this is intended to produced archivable Dockerfiles, or Dockerfiles that are intended to be called as part of a wider run system.

Is it worth trying to situate Dockerfile in different workflows, eg standalone, user runs container created from image built from archived Dockerfile, vs Dockerfile that is created as part of some runtime process or system where the Dockerfile is intended as a component part of that system which may caveat the use of the Dockerfile outside that system (making the Dockerfile essentially part of a blackbox build process)?

FWIW, I think repo2docker was originally based on Openshift's source2image / s2itool. See here for a discussion of the repo2docker history and why it moved away from s2i.

vsoch commented 5 years ago

I have some strong disagreement here. In the case that you use a tool to generate a Dockerfile, and you are able to commit your Dockerfile alongside your code (and build regularly for security fixes, etc.) this seems okay. But to direct most users to tools that abstract away the process and spit out a container (and the user has no idea how to generate the same on his/her own) this seems highly dangerous. The reasons are:

Dockerfiles are fairly simple. It doesn't mean that someone should roll their own, or shouldn't use a tool, but to blatantly say "Don't write Dockerfiles by hand" really says "You should generate them with tools" and frankly I don't think the tools out there are good enough to say that. I would adjust this rule to be something like "Rule 1: Use tools to assist with Dockerfile generation."

vsoch commented 5 years ago

I'll open a PR with my suggested changes / spelling fixes, etc. I'm just on Rule 1 so expect a little bit :)

vsoch commented 5 years ago

Just curious - have you not written the meat of most of the paper yet (but bulletpoints are here instead?) Should I follow bullet points format until the ideas are flushed out?

psychemedia commented 5 years ago

I also wonder about having templated Dockerfiles that then pull on other files in a disciplined way.

For example, should you:

a) include Python or Linux package requirements explicitly in a Dockerfile; or b) include Python or Linux package requirements in conventionally named files such requirements.txt and apt.txt and then cause those to be acted on via boilerplate code in the Dockerfile.

This then makes the Dockerfile more of a build tool than a dependency specifying tool?

In turn, this also means, in the short term, identifying sensible conventions, such as those proposed by The Reproducible Execution Environment Specification.

vsoch commented 5 years ago

For a template recipe that is intended to build someone else’s container with something like a requirements.txt added by the user, ONBUILD would be appropriate here. Otherwise for software that is built into the container, any kind of dependency file as long as it’s kept in the container works well. The distinguishing feature between these two youth cases is whether the Dockerfile is intended to be used as a build template (ONBUILD) or to fully represent a reproducible analysis.

betatim commented 5 years ago

I agree and disagree. You shouldn't rely on tools because it makes it a new dependency. However writing a Dockerfile that works well is pretty tricky. However reading one is pretty straight forward.

The reason I think writing Dockerfiles is tricky is that this is what I think is a good (not perfect) one for a set of software that isn't very complicated. If we wrote this from hand and tuned it for this particular use case we could probably reduce it to 50-100 lines instead of 150. However if I had to actually write it by hand I'd end up with something even shorter by not doing several things that this file does do. These omissions would probably come to bite me at some later point.

The counter argument to "don't use a tool but generate a Dockerfile once" is that by using a tool I automatically benefit from all the improvements and security upgrades made to the tool and the container images it generates.

https://repo2docker.readthedocs.io/en/latest/specification.html is what I'd really like to see pushed forward. The final goal of it would be to provide a human readable set of instructions of how to "process" the information from a repository to create the computational environment it wants to run in. If we achieve this repo2docker is just a tool you use because it is more convenient than typing stuff yourself not because it does stuff you don't understand.

vsoch commented 5 years ago

I think an issue that we are running into (and will continue to) is that for both the writers and readers of this manuscript, we all bring very specific ideas and use cases for running containers. For example, @nuest I'd guess you are heavily using containers for interactive (and then reproducing) data science, and you use a lot of Rocker bases. @betatim you work on repo2docker, so you are heavily using notebooks (and likely a lot of other stuffs). I'm heavily a user for containers on HPC, meaning usually not docker, but derivatives from it, which works nicely for the most part, but adds another level of "Best Practices" to Docker containers that are intended to be pulled as read only containers. Aside from that, the majority of my container-ing is for cloud infrastructure and development of it.

So - with this context it would be very challenging to make a global statement about the "Top 10 Simple Rules" for Dockerfiles. The use case that any given reader or writer has in mind when reader is very likely different. For this particular point about writing by hand versus using a tool, this hugely depends upon what the reader/writer has in mind, and the exact use case. If you are creating a data science notebook for just use on a local machine? Sure a rocker base would work, and repo2docker would be a great tool for that. But if you are intending to run the same notebook on your research cluster? A rocker base would crash the machine. Thus, how do we give advice about these rules without being very explicit about these different situations?

Another dimension that is embedded here is the (somewhat) of a trade-off between reproducibility and other things we care about like security. Sure, I can hard code the most specific of versions into a container, and that means it's reproducible (and for sake of argument let's assume that automated regular builds are happening) but if security fixes are pushed to ubuntu:18.04 but I'm using ubuntu:bionic-10102019 (and latest becomes bionic-11102019 or whatnot) my container (if rebuilt) will reproduce exactly, but I won't have those fixes.

For these reasons, I think this is why you have to start with the "What am I trying to do?" question, and then ask "Is there a tool that is optimized for that goal?" and if yes, use it (for example, if I'm just trying to build some notebook that matches a formula in repo2docker, I'm good!), but if not, then look to regularly updated base containers that provide closest to what you are looking for, and then (hopefully) build simple Dockerfiles from that. But this kind of workflow feels very different to me than some global advice to "Don't write Dockerfiles by hand."

psychemedia commented 5 years ago

Clashing the "what am I trying to do?" question with the "10 best rules..." idea, it seems could ask "10 Best Docker Rules For... X" where X is: security, (personal) reproducibility, (personal / social) sharing, HPC, long-term archiving, and from this need to decide whether the rules are particularly opinionated in view of one of these stances?

When it comes to the reproducibility or sharing stance, I think care needs to be taken of reproducibility when/where. If we know access to some sort of Dockerhub is guaranteed, Docker is available to all stakeholders, and we only need to be able to reproduce things over 1-2 months, and then behind enterprise auth, we may be okay with possibly insecure container images at a particular build called from a "reproducible" Dockerfile.

If each stakeholder wants to build an image from scratch, then a different sort of Dockerfile is required. And even then, we have to decide on what the base image is and how reproducible that is.

Even if you have institutionally managed base containers (security patched, maybe even supported) there are still issues of how you manage what folk add in to the container, whether anything under the control of the base image could break things higher up the stack if the base image is changed in any way, or what you do about folk higher up the stack undoing things in the base image.

Are there threads anywhere about the policies taken maintaining things like repo2docker base layers, Jupyter docker stack images, or environments defined in Azure Notebooks, for example, and how they try to mitigate against breaking changes for users caused by changes to base images clashing with customisations users are likely to have added to base images?

vsoch commented 5 years ago

I think it might be good (for this paper) to scope it to one of those particular (X) use cases, and clearly define the use cases that we have in mind before diving into the reasons. If the goal is reproducible data science, then that would include both data science / notebooks and HPC uses. Thoughts?

nuest commented 5 years ago

Apologies for just now catching up on this discussion.

I agree with the scoping to one use case, and going through the great efforts that @vsoch did in #21 I also realised that the number of useful (!) recommendations and the level of detail is high by now, so reducing the scope should increase understandability and usability of the guidelines.

To me, X = "reproducibility of scientific workflows for published articles". There are certainly other but secondary goals that are helped by that (sharing, archiving, collaboration, re-use). I would even go with the "personal reproducibility" before trying to recommend usage by other because as @vsoch points out the use cases can be very diverse. The HPC-ready image might just as well "crash" on my local machine as the Rocker image does on a cluster, right?

An regarding the global advice not to write Dockerfiles by hand: I think it makes sense to have this the first rule, because users should think about it first. I wouldn't say it is the most important one.

Re. self-containment of Dockerfiles vs. externalising stuff in requirements.txt (at least that's my perspective on this question I see value in both: for self-containment because, well, it's all there. For externalising because these existing configuration files are (a) more likely used in everyday working and development (unless we strictly recommend people to only work in a container), and (b) complemented by stable tooling and documentation so users can achieve effectively what they need. The crucial argument for me is: even if parts of the information are off-loaded to a requirements.txt, that fact is transparent in the Dockerfile and thus may be investigated if need be.

vsoch commented 5 years ago

The HPC-ready image might just as well "crash" on my local machine as the Rocker image does on a cluster, right?

Exactly. Even saying "reproducibility of scientific workflows" means different things. It could be a container with MPI (hugely likely to not work with Docker, but with Docker -> Singularity), something intended to run on slurm / sge / similar, or (in your case @nuest) to just run on your local machine with smaller data.

I understand the spirit of the rule and agree - if there does exist a tool or base image that is maintained and uses (some) best practices, use it! But very commonly there isn't, so the real task is to:

In an ideal would we would have a much larger selection of both base images and tools to choose from (and the user wouldn't need to write a Dockerfile) but we aren't quite there yet.

nuest commented 5 years ago

And even if we are "there" at one point in time, that might change again, especially in research related workflows, because people will want to use new things, and there might be some catch-up game between features and generator tools.

Anyway, I'll revisit the comments here again and try to iterate on rule 1 once more with the notion of the rule being a multi-step process (no suitable tool? no suitable image? no suitable base image?).

@nokome maybe you have the time to provide your perspective here (or in any other section of the article!) ?

vsoch commented 5 years ago

Sounds good! I definitely +1 that multi-step process approach.

nuest commented 4 years ago

Is holepunch worth including, as yet another solution to get to a Binder-ready repo?

psychemedia commented 4 years ago

FWIW, I've seen holepunch mentioned in the wild a couple of times...

vsoch commented 4 years ago

I've never heard of it! But I think I might live in a different jungle :P :tanabata_tree: :lion: :monkey:

psychemedia commented 4 years ago

I think I saw it mentioned in context of various workshops, though offhand I don't recall which. Trying to find them, I find s/thing I haven't spotted before — a recent w/s: A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker (psyarxiv can be slow to load? Direct alt-link to pdf).