nuest / ten-simple-rules-dockerfiles

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science
https://doi.org/10.1371/journal.pcbi.1008316
Creative Commons Attribution 4.0 International
62 stars 15 forks source link

comments about rule 5: "Specify software versions" #100

Open sdettmer opened 2 years ago

sdettmer commented 2 years ago

comments about rule 5: "Specify software versions"

  1. A version number is not reliable, usually it cannot be formally guaranteed that an artifact identified by name and version is unique. There were examples where packages were repackaged due to errors in included license terms. This should normally of course have no effect on the build result, but it can be. In a very different environment (i.e. not docker) once (and many years ago) I had a case in my team where a library returned a identifier string (version string) that was copied into a static sized buffer, which overflowed and lead to a crash right at the start of the application linked against that library. The version string was “fixed” without changing the version number because “legal reasons” but got a few characters longer, and in a combination with other issues lead to the buffer overflow.
  2. Nowadays we can observe another bad habit called semantic version https://semver.org/spec/v1.0.0.html. Where a library is available in version 1.2.3 and version 1.2.4 and 1.2.3 needs to be patched, a strong branch-able version scheme would introduce 1.2.3.0.1 (yes, two more digits are required for the general case), but with SemVer people are just lost. They try to address some issues in their 2.0.0 version, which would allow to call the 1.2.3 patch 1.2.4-<ALPHATAG>, because 1.2.4-<alphatag> is between 1.2.3 and 1.2.4, but it is at least counter intuitive and still broken by design. glibc had a famous license issue and presented versions with a postfixed a, like 1.23a (I don’t recall what versions were affected). Any other project can choose any other scheme and each breaks reproducibility. Local copies are required.
  3. Versions of forcibly used local copies technically don’t matter. In short: The system must be stable even for bad versions - which in turn means, versions don’t need to be good.
vsoch commented 2 years ago
  1. It's not perfect, but it's mostly reliable, there are always edge cases.
  2. Many would disagree with you that this is a "bad" habit, and most tools I've worked with can handle semver or these special cases.
  3. Typically we are making the assumption that using a version is more of a "pin" - a point in time, over a main branch or similar, so it's at least an effort by the author to have more consistency than not using one. And "good" or "bad" is relative is it not? The pinned version just needs to work for the author of the container in question, which could be determined by testing the resulting container.
sdettmer commented 2 years ago

@vsoch thank you for your quick reply.

  1. Yes, in short, you must copy the package, not store its identifier (especially if it is not even a message digest like in Git).
  2. Yes, I know, because these people do not maintain old versions, they just deliver newest version and that's it. Simple cases (like having few old baselines maintained) may accidentally work with SemVer, so even then people sometimes do not notice the problems. The tools and the people using it make false assumptions which often do not harm, but just since it accidentally works for them IMHO is not a sufficient base for recommendations, especially giving that a fully working scheme exists since ages, which is simple and superior. Environments, that work fine with that, probably often do not need reproducibility (they ship new everything every time anyway), but here the focus is reproducibility and thus different environments. For example these, they need to present old results again and might need to fix mistakes in them - this quickly requires branching on that old state instead of using a new one, and branching (in general) is not possible with SemVer.
  3. Yes, the "pin" is used when the system is not fully reproducible to "pin" external dependencies, but this works only as long as the external decency is accidentally met (accidentally, because out of control). But for reproducibility, there must be no externally uncontrollable dependencies, but local copies. So the reproducible build system uses local copies, which might be pinned by it path. In practice, you cannot trust version number of external decencies - who says that their maintainers read your rules :-)
vsoch commented 2 years ago

So in summary, "the real problem with software reproducibility is people" is what I'm hearing. :laughing:

We do our best @sdettmer. Again, we are not perfect, and that's OK.

sdettmer commented 2 years ago

@vsoch Yes, of course you are doing exceptional well and in no way I think anything else! Also I see the many advantages this approach has, for example lower costs. It is just that for reproducibility it seems to be misleading. According to my experiences, your containers probably won't build in ten years for some little details, and in ten years it can be very difficult to fix these details. I assume you already guessed that I spent quite some time doing so :-) Maybe it is not needed (and then not worth the effort) to be reproducible in ten years, sure, but if so, better have local copies of everything needed. And a DVD drive if needed, try to get a working DLT1 drive or predecessor device today and you know what I mean, or a 8 inch disk drive :) Maybe the duration for the reproducibility could be added. If it is just more or less lets say weeks (until everybody upgraded or so), or is it about years (a bit harder), 1-2 decades (starts getting interesting) or even more?