Open minrk opened 5 years ago
I think this would be cool. How could we make it opt-in so that the default remains the pip install ...
that a user would type? This way we stick with our principle to do what a user would do and not something slightly different.
Two options that come to mind straight away: use something like python-3.7 2019-01-01
in runtime.txt
or search for a comment like # moment in time: 2019-01-01
on the first line of the requirements.txt
.
since we use the same thing for R, I almost want a generic timestamp field (another idea for repo2docker.yml), where a value like 'true' or 'auto' could use commit date rather than a specified point in time. I don't like adding new repo2docker-specific inputs, though.
There are a few ideas, here:
I think our two goals of facilitating reproducible publications and more lax workshop/demo/education repos make it hard to have clear choices on things like this.
There's no way we are reproducing the author's environment if any package we are installing comes after the commit date.
You can mix and rearrange git commits, e.g. git rebase
, so the commit date won't necessarily be the date the repo is known to work. It makes sense for tags though, and might also make sense if there's a way to get the timestamp for the git push
associated with the commit.
Commit date seems too auto-magic and brittle to me. (What happens if I update the README?)
Having python-x.y-yyyy-mm-dd
in runtime.txt
seems the best option (as for R). For better reproducibility, it should also pin the python patch version (or we have that as python-x.y.z
) and the installation tools, like pip
, setuptools
(e.g., MarkupSafe==1.0
does from setuptools import Feature
which has been removed), and wheel
.
True reproducibility is hard and I think it's fair to ask users wanting to go the extra mile to do a little more work (learning about the issues along the way).
Having done a study of repos, mostly repos that existed a couple of years ago, this summer with @vildeeide, I increasingly think we should at least be picking the default Python version based on the commit date, assuming the repo has not specified anything else. tons of repos have unpinned Python (because there is still no standard, widely used way to pin Python), but pin e.g. numpy or pandas to a version that doesn't work on Python 3.8. A very large fraction of these would have built just fine if we picked the fairly obvious Python 3.6 as a starting point.
@mdeff to your point, I think if a repo updates the README and doesn't pin things like Python, it is absolutely reasonable for that to result in building against a new Python. That's already the case today, as any change in a repo results in a rebuild, which will always get latest Python if unspecified. Note that we are only talking about the default behavior when folks leave Python unspecified. As it is now, when build is requested is how we decide what version of Python to use, completely regardless of repo modification times. This change should make it strictly more likely that stale repos continue to build for longer.
@manics I don't think we should be overly concerned with diverging commit dates. While it's technically possible to fake dates, this is rare and the consequences are negligible. Things like rebase allow GIT_AUTHOR_DATE and GIT_COMMITTER_DATE to diverge, and we should think carefully about picking which to consider (COMMITTER_DATE should always be later, and is probably more appropriate when they do diverge. We could also pick the later of the two), but these rarely diverge by more than days and are unlikely to cross a transition point, which currently happens once per year.
I think there are two separate issues here:
At this point, I think we should begin to use commit date to pick the default Python minor version, and save pypi-timemachine to an opt-in experiment for folks interested in stricter reproducibility, though if it is opt-in, I'm not sure it has any benefit over a standard pinned environment.
Should we make a new issue for "Pick Python minor version based on commit date"? I like the idea of picking Python versions and also splitting this and pypi-time-machine up.
issues like #729 are considered "unsolvable" because the state of PyPI can change over time. However, pypi time machine is exactly the sort of tool that's meant to solve this problem by only considering packages available at a given date.
repo2docker could combine pypi-time-machine with git commit date for more reproducible builds. I'm not sure if this is a good idea or not, but we effectively do this same thing with R by using Microsoft's snapshots.