pypa / pipenv

Python Development Workflow for Humans.
https://pipenv.pypa.io
MIT License
24.82k stars 1.86k forks source link

Make Pipfile.lock sufficient for build reproducibility #5575

Open rittneje opened 1 year ago

rittneje commented 1 year ago

Be sure to check the existing issues (both open and closed!), and make sure you are running the latest version of Pipenv.

Is your feature request related to a problem? Please describe.

Unfortunately, the Pipfile.lock alone is insufficient for true reproducibility over time. For example, running pipenv install might select the manylinux_2_24 wheel of some dependency today, because my system only has glibc 2.24. But then if I upgrade to glibc 2.28 later, running pipenv install with the exact same Pipfile.lock will now select the manylinux_2_28 wheel. This is because the lockfile contains the hash of every known distribution, which is itself necessary in order for developers on different operating systems to share the Pipfile.lock.

Similarly, if I run pipenv install on a system with glibc 2.28, and another developer runs it on a system with glibc 2.24, we will install two different things.

Describe the solution you'd like

Unfortunately, I don't really know what to do about this given the current Python module ecosystem.

Maybe there can be flag that, if set, causes the Pipfile.lock to only include hashes for the "any" distribution by default, and for any other distribution you have to include the matching identifier in a new section of the Pipfile?

Describe alternatives you've considered

We can manually edit the Pipfile,lock to remove the hashes for the distributions it should not select, but this is cumbersome and error-prone, especially because the Pipfile.lock itself does not say which is which.

matteius commented 1 year ago

I have read this issue a couple times, and I think that the build is reproduceable the problem is that the system changes and that build is also reproducible, but the alternate wheel is somehow problematic for the system? I don't quite understand this part--that seems like a bug in the specific package being requested but not enough details are given about the issue when it selects the manylinux_2_28 wheel.

rittneje commented 1 year ago

There are two pieces of the puzzle.

  1. One of the motivations for using pipenv is to make sure that all developers install the exact same thing. With a regular old requirements.txt file, this is difficult to do, particularly with transitive dependencies. The desire is that as long as Pipfile.lock has not changed, pipenv will install the same thing on my machine and your machine today and tomorrow. However, because wheel selection exists outside of what the Pipfile.lock restricts (soft of), this is not true, unless we specifically use only pure Python modules.

  2. In one particular use case, we want to download the dependencies on behalf of another machine. We know that machine only has glibc 2.24. However, the machine that is downloading the dependencies happens to have glibc 2.28. So pipenv ends up selecting the manylinux_2_28 variant, which is not what we want.

To me, both of these are due to the same underlying problem - pipenv does not currently allow you to restrict which wheels it will record in the Pipfile.lock, and by extension which wheels are eligible for installation, without manually editing the Pipfile.lock. Fixing #5571 would help to some extent, but it would still be preferable for all the selection logic to come from the Pipfile.

matteius commented 1 year ago

@rittneje :

One of the motivations for using pipenv is to make sure that all developers install the exact same thing.

Kind of -- the motivation is to install the exact same version, not necessarily the same thing, or there would never be more than one hash in the Pipfile.lock per package.

In your particular case you are trying to build the binary on one machine and run it on a different machine type -- this has never been supported. I would suggest building a docker container and running the built container one whatever system such that the container is the same dependencies/OS as what it was built with.

pipenv does not currently allow you to restrict which wheels it will record in the Pipfile.lock, and by extension which wheels are eligible for installation, without manually editing the Pipfile.lock

This seems true, but I wonder how we can really support that with the pip resolver under the hood -- we aren't modifying the code within pip unless it involves a security fix (such as the package index restrictions), and so if there is a way to support this with additional arguments to PackageFinder -- you could try modifying the code here to pass a relevant platform to the calls to get_package_finder(: https://github.com/pypa/pipenv/blob/main/pipenv/utils/resolver.py#L564-L582 There is one other spot that calls this function too: https://github.com/pypa/pipenv/blob/main/pipenv/environment.py#L611-L613

The goal would be determining if there is a way to restrict the package finder or the resolver in someway to find only what you are looking for, or at least install in your preferred order of what is found, possibly by restricting the finder platform. However I am not familiar with manylinux so my testing capabilities may be limited, but I would be curious whatever could be found out.

rittneje commented 1 year ago

Kind of -- the motivation is to install the exact same version, not necessarily the same thing, or there would never be more than one hash in the Pipfile.lock per package.

Right, I mean the motivation for us is to have it always install the same thing, to avoid the whole "works on my machine" problem. I know that sometimes it is inevitable that it might have to install different things depending on what kind of modules are actually published to PyPI, but the ask is to be able to control that. This is also helpful from a security/auditing/testing standpoint. If I've verified that, say, the Linux wheel of some library is good, that doesn't mean that the macos wheel is also good.

I would suggest building a docker container and running the built container one whatever system such that the container is the same dependencies/OS as what it was built with.

We are using docker. The problem is that arranging for it to not have glibc 2.28 is non-trivial. It would be a lot simpler (for us) if pipenv could be told not to consider the manylinux_2_28 wheels.

we aren't modifying the code within pip unless it involves a security fix (such as the package index restrictions)

Can you elaborate on how exactly pipenv finds all the hashes so it can write them into the Pipfile.lock? That does not seem to be a normal function of pip, since it would only select a distribution for the current platform.

matteius commented 1 year ago

Can you elaborate on how exactly pipenv finds all the hashes so it can write them into the Pipfile.lock? That does not seem to be a normal function of pip, since it would only select a distribution for the current platform.

@rittneje I can try to sure, basically we do something we have been asked not to do, which is use Pip's internal resolver under the hood. We instantiate internal pip classes throughout and the resolver code path in the pipenv implementation is still a bit messy but you can see in here we are using the internal pip resolver to find the specific version of the package, but then it looks like it flows through collect_hashes and for pypi.org packages calls to _get_hashes_from_pypi which gets the hashes from the json API from pypi https://pypi.org/pypi/{ireq.name}/json -- so for example, scikit-learn https://pypi.org/pypi/scikit-learn/json

Which basically scrapes the whole set of URLs available.

{
"comment_text": "",
"digests": {
"blake2b_256": "f0950ea0a2412e33080a47ec02802210c008a7a540471581c95145f030d304b4",
"md5": "d2c9f4ae53bce092f74e0798f9ff842d",
"sha256": "5b2c5d9930ced2b7821ad936b9940706ccb5471d89b8a516bb641cec87257d1c"
},
"downloads": -1,
"filename": "scikit_learn-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "d2c9f4ae53bce092f74e0798f9ff842d",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 9763865,
"upload_time": "2023-01-24T16:42:57",
"upload_time_iso_8601": "2023-01-24T16:42:57.643504Z",
"url": "https://files.pythonhosted.org/packages/f0/95/0ea0a2412e33080a47ec02802210c008a7a540471581c95145f030d304b4/scikit_learn-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},

Only thing I can see in there is the filename might have enough information in it, if there were already a parser for wheel filenames to get the platform, and then we could extend the pypi method to restrict to specific criteria, which I think is what you are asking for. I definitely don't have time to take this on, but I could help coach someone on it. The other thing to consider is this won't work for private pypis, which currently fetch the hashes in an even more primitive way by downloading each link on the html page, so maybe there is a way to apply the same logic there, but I haven't really thought about it yet.

kalebmckale commented 1 year ago

@matteius Do you think this is something that can be specified by PEP 508 specifiers? I've not ventured into using them yet, but if so and the specifier is used in the Pipfile. Does the Pipfile.lock then restrict / filter the versions it includes?

matteius commented 1 year ago

@rittneje It is possible that with the improvements to hash collection in 2023.9.1 that you will have better results of this issue, but I cannot be certain without a re-check.