nuest / ten-simple-rules-dockerfiles

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science
https://doi.org/10.1371/journal.pcbi.1008316
Creative Commons Attribution 4.0 International
62 stars 15 forks source link

Disussion - Rule 9. Publish a Dockerfile per project in a code repository with version control #16

Closed psychemedia closed 4 years ago

psychemedia commented 4 years ago

Typo: several proecceses

Use one Dockerfile per workflow or project and put one "thing" in; TO DISCUSS: argue against the above rule and recommend having a process manager and multiple processes in one container

I think original best practice was to have one thing per container and then use docker-compose to build eg workbenches from several inter-networked containers.

Recent examples show how to use things like supervisord to run and manage multiple services with the same container.

To try to get my head round this, I did some naive exploration here. One issue that arose was if you are enabling multiple services, how should you build the image? All the installs in a single Dockerfile? Or a set of staged Dockerfiles that build on top of each other (note - this is not a multistage build, where you essentially extract from prior images to build layers you then incorporate in your final image. I'm not sure what best practice would be?

Also, this raises the issue of Dockerfile stacks, eg as per Jupyter docker stacks or the legcay DIT4C containers, as a good practice strategy?

(In a research group, you may want a base image that every other project builds on; in edu, I've been trying to explore how a base institution container might contain a minimum viable, branded notebook server, for example, that could be used as a base for course customised environments/ images.)

This rule mentions connecting to databases; one issue there is that you may also want a recipe that builds a seeded database, not just a database. Building images that provide access to computational environment + data environment is often what is required for reproducibility.

On this point, where do things like notebooks sit? They are outside the computational environment, but inside the analysis environment, along with particular datasets (are datasets inside the analysis environment, or a sibling of the computational and analysis environments?). To make something reproducible, you need a computational environment and a data environment that the analysis scripts can run against?

psychemedia commented 4 years ago

This rule mentions git. Would it be worth creating a worked up example repo that demonstrates a best practice process through the use of a first commit, then several further commits showing how to add particular features, how to refactor against particular rules, etc.

This could take quite a lot of work to sketch out, and then, erm, commit... It would be a bit like writing a play, and then performing it...

nuest commented 4 years ago

@psychemedia I think the process of iterating through the rules based on a given Dockerfile sounds very interesting, certainly related to #4. I cannot, erm, commit to doing that right now, I'd rather get the manuscript out and see what other (experts) think.

psychemedia commented 4 years ago

Do you know if anyone has a Dockerfile of which they are particularly proud and we could perhaps try to reverse engineer it?

vsoch commented 4 years ago

What do you mean reverse engineer a Docker file? You mean reverse engineer a container image?

psychemedia commented 4 years ago

(from my not very useable phone): no, more a commented critique of a pre-existing dockerfile, or a step by step walk through of how you might end up with a particular dockerfile from steps that follow the rules

sje30 commented 4 years ago

Yes, I was thinking the same -- a "before" and "after" dockerfile.

On Sun, Mar 15 2020, Tony Hirst wrote:

(from my not very useable phone): no, more a commented critique of a pre-existing dockerfile, or a step by step walk through of how you might end up with a particular dockerfile from steps that follow the rules

nuest commented 4 years ago

I've reopened the examples issue #4 and added a call for inputs in the READMEs.

Regarding image stacks, I think we scratch on the surface of that topic now in the article with the multiple images for different use cases (Rule 8), but shouldn't go deeper for the targeted audience.