nuest / ten-simple-rules-dockerfiles

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science
https://doi.org/10.1371/journal.pcbi.1008316
Creative Commons Attribution 4.0 International
62 stars 15 forks source link

Discussion - Rule 3. Use formatting and favour clarity #10

Closed psychemedia closed 4 years ago

psychemedia commented 4 years ago

Would it make sense to give explicit examples of good practice and bad practice, perhaps in a contextualised way, eg show a colour highlighted git diff going from bad practice to good practice with a comment or a git commit line explaining the change in terms of the rule applied? Or maybe link to a supporting git repo where a scrappy Dockerfile has been revised into a best practice example?

When mentioning:

put each dependency on it's own line, makes it easier to spot changes in version control split up an instruction (especially relevant for RUN) when you have to scroll to see all of it

a naive reader may misinterpret this instruction and put lots of things on separate lines each with its own RUN command, which would break layering?

So the instruction:

don't worry about image size, clarity is more important (i.e. no complex RUN instructions that remove files after using rightaway)

is problematic when it comes to writing Dockerfiles that build "efficient" images?

have commands in order of least likely to change to most likely to change, it helps readers and takes advantage of build caching

Would it make sense to have a section at the start of the paper that describes the anatomy of a Dockerfile, and perhaps also situates it in a workflow (Dockerfile -> image -> container).

Only switch directoryies with WORKDIR {-}

[typo - directoryies] So in terms of best practice, is there something here about identifying not just which directory you are in and how to change it, but how to select appropriateUSER's for running certain commands?

nuest commented 4 years ago

I personally don't think that "efficiency" in the images matters for reproducible computing environments. Rule 10 gives some advice how not to constantly re-build the image, but other than that I think a data scientist will spend way more time thinking and running a workflow than she will spend waiting for an image to build.

Re. "anatomy of a Dockerfile": good point, tried to do the minimum only in https://github.com/nuest/ten-simple-rules-dockerfiles/commit/e482b86a1fa97d1c038df9992ffb75d54a9aba21 and https://github.com/nuest/ten-simple-rules-dockerfiles/commit/d08dc3e9cf820a5f47566d4d33a400c75df2d95a

IMO splitting a RUN command does not "break" layering, it might even improve it. Still, tried to clarify.

Since you digressed from the "one discussion issue per rule", I'll close this one, feel free to re-open if important comments are not yet addressed.

psychemedia commented 4 years ago

Re: splitting things across RUN commands - it adds the the number of layers doesn't it? Which means it can also affect the amount fo time/bandwidth required to download an image.

If you have a build step with an rm tidy step, if you finish the RUN with the && rm ... command, the thing that's removed is not in the layer, whereas if you do RUN ..build bits... then RUN rm ..., you do download the stuff in the first layer and you then delete it from the second?

Of course, my naive understanding of how the implemented mechanics of docker containers actually work could be completely wrong!