nuest / ten-simple-rules-dockerfiles

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science
https://doi.org/10.1371/journal.pcbi.1008316
Creative Commons Attribution 4.0 International
61 stars 15 forks source link

comments about rule 7: "Mount dataset at run time" #102

Open sdettmer opened 2 years ago

sdettmer commented 2 years ago

comments about rule 7: "Mount dataset at run time"

To be reproducible, there is nothing that can be different at different runtime, all must be the same. Thus it does not matter when to include data. Of course, for practical use on different data or maintenance this might be a good habit.

vsoch commented 2 years ago

The data itself can be "the same" from an archive - that still doesn't make it feasible to store a very large dataset in the container.

sdettmer commented 2 years ago

@vsoch It might not be most efficient or easy to maintain, but it is best for reproducibility. Of course there are other requirements beside reproducibility, such let's say "maintainability", so the rule should be moved there. Actually a lot of rules lead to conflicts: the more reproducible something is, the more expensive the maintenance might become (alone storing everything is a burden).

But if you have an archive of the data anyway, why not adding it as layer in OCI and be done?

vsoch commented 2 years ago

It doesn’t sound like you’ve worked with the large datasets that I have in the past, where it would not be reasonable or feasible to add to a container.

sdettmer commented 2 years ago

@vsoch Yes, I did not work with large data, my biggest RAID is only 0.1PB, so I couldn't even store that (I know LHC produces 2 GB/sec and has some hundred PBs stored), but I work on reproducible systems (reproducible in "continuum" sense from previous item, unfortunately :)). But for small data sets, why should these mounted at runtime? Because the rule says so :D

vsoch commented 2 years ago

No small datasets are appropriate to add to the container, given it's de-identified.

As an exception, you should include dummy or small test datasets in the image to ensure that a container is functional without the actual dataset, e.g., for automated tests, instructions in the user manual, or peer review (see also “functional testing logic” in [12]). For all these cases, you should provide clear instructions in the README file on how to use the actual (or dummy) data, and how to obtain and mount it if it is kept outside of the image. When publishing your workspace, e.g., on Zenodo, having datasets outside of the container also makes them more accessible to others, for example, for reuse or analysis.

The rule targets large datasets time and time again, so perhaps we didn't make this clear enough because we added a context that it's for review/testing, which it doesn't need to be. Small datasets are fine to include if it's reasonably small.

sdettmer commented 2 years ago

@vsoch yes, I see. Please remember I don't want to critise any work but I like to give input for possible future improvements. If the rule applies only for large data where it is needed anyway, why have it at all?

vsoch commented 2 years ago

The rule is saying:

It's reasonably stated and although the focus is on large datasets to tell the user about bind mounts, small datasets inside the container (with other small files) is also in scope. This feels like nit picking to me, and losing sight of the audience that the writing was intended for.

sdettmer commented 2 years ago

@vsoch hehe yes it might be I just wonder why you have a rule like "if you cannot put the data in the container, then don't put it in the container" :)

vsoch commented 2 years ago

Because it is possible to put large data in a container, and people do try (and registries vary based on the size / number of layers to accept) so it needs to be explicitly stated. A container is not a storage vehicle for large data - it's for small data and/or software and analysis scripts. We have better means (data archives, object storage) more appropriate for that.