jupyterhub / repo2docker

Turn repositories into Jupyter-enabled Docker images
https://repo2docker.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.62k stars 360 forks source link

pypi time machine? #740

Open minrk opened 5 years ago

minrk commented 5 years ago

issues like #729 are considered "unsolvable" because the state of PyPI can change over time. However, pypi time machine is exactly the sort of tool that's meant to solve this problem by only considering packages available at a given date.

repo2docker could combine pypi-time-machine with git commit date for more reproducible builds. I'm not sure if this is a good idea or not, but we effectively do this same thing with R by using Microsoft's snapshots.

betatim commented 5 years ago

I think this would be cool. How could we make it opt-in so that the default remains the pip install ... that a user would type? This way we stick with our principle to do what a user would do and not something slightly different.

Two options that come to mind straight away: use something like python-3.7 2019-01-01 in runtime.txt or search for a comment like # moment in time: 2019-01-01 on the first line of the requirements.txt.

minrk commented 5 years ago

since we use the same thing for R, I almost want a generic timestamp field (another idea for repo2docker.yml), where a value like 'true' or 'auto' could use commit date rather than a specified point in time. I don't like adding new repo2docker-specific inputs, though.

There are a few ideas, here:

  1. if our goal is reproducibility, should this be the default behavior based on commit date? There's no way we are reproducing the author's environment if any package we are installing comes after the commit date. This might even apply to the repo2docker version as well, since there are two sources of packages: the base env and the user's packages.
  2. if our goal is simple automation of installs, then this should definitely be opt-in.

I think our two goals of facilitating reproducible publications and more lax workshop/demo/education repos make it hard to have clear choices on things like this.

manics commented 5 years ago

There's no way we are reproducing the author's environment if any package we are installing comes after the commit date.

You can mix and rearrange git commits, e.g. git rebase, so the commit date won't necessarily be the date the repo is known to work. It makes sense for tags though, and might also make sense if there's a way to get the timestamp for the git push associated with the commit.

mdeff commented 4 years ago

Commit date seems too auto-magic and brittle to me. (What happens if I update the README?)

Having python-x.y-yyyy-mm-dd in runtime.txt seems the best option (as for R). For better reproducibility, it should also pin the python patch version (or we have that as python-x.y.z) and the installation tools, like pip, setuptools (e.g., MarkupSafe==1.0 does from setuptools import Feature which has been removed), and wheel.

True reproducibility is hard and I think it's fair to ask users wanting to go the extra mile to do a little more work (learning about the issues along the way).

minrk commented 4 years ago

Having done a study of repos, mostly repos that existed a couple of years ago, this summer with @vildeeide, I increasingly think we should at least be picking the default Python version based on the commit date, assuming the repo has not specified anything else. tons of repos have unpinned Python (because there is still no standard, widely used way to pin Python), but pin e.g. numpy or pandas to a version that doesn't work on Python 3.8. A very large fraction of these would have built just fine if we picked the fairly obvious Python 3.6 as a starting point.

@mdeff to your point, I think if a repo updates the README and doesn't pin things like Python, it is absolutely reasonable for that to result in building against a new Python. That's already the case today, as any change in a repo results in a rebuild, which will always get latest Python if unspecified. Note that we are only talking about the default behavior when folks leave Python unspecified. As it is now, when build is requested is how we decide what version of Python to use, completely regardless of repo modification times. This change should make it strictly more likely that stale repos continue to build for longer.

@manics I don't think we should be overly concerned with diverging commit dates. While it's technically possible to fake dates, this is rare and the consequences are negligible. Things like rebase allow GIT_AUTHOR_DATE and GIT_COMMITTER_DATE to diverge, and we should think carefully about picking which to consider (COMMITTER_DATE should always be later, and is probably more appropriate when they do diverge. We could also pick the later of the two), but these rarely diverge by more than days and are unlikely to cross a transition point, which currently happens once per year.

I think there are two separate issues here:

  1. how do we pick the default Python version, if unspecified, and
  2. when/how to we pin package repositories with something like pypi-timemachine

At this point, I think we should begin to use commit date to pick the default Python minor version, and save pypi-timemachine to an opt-in experiment for folks interested in stricter reproducibility, though if it is opt-in, I'm not sure it has any benefit over a standard pinned environment.

betatim commented 4 years ago

Should we make a new issue for "Pick Python minor version based on commit date"? I like the idea of picking Python versions and also splitting this and pypi-time-machine up.