pip should record the original index url or find links url of the package

sambhav commented 2 years ago

What's the problem this feature will solve?

Currently users may download packages from various pip repositories (apart from pypi). These packages may contain alternative versions of packages from pypi. In order to better capture origin information, pip should record which package repository it downloaded a specific package from.

Describe the solution you'd like

pip should record this information in the dist-info or egg-info folders it generates

Alternative Solutions

Do nothing

Additional context

This would be useful in capturing the original info and provenance data about pip packages. Specifically package-url (a well defined package id spec) relies on a parameter called repository_url to define this information. If pip records this information downstream sbom generation tools can use this. Related issue - https://github.com/anchore/syft/issues/680

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

pfmoore commented 2 years ago

These packages may contain alternative versions of packages from pypi.

Please clarify this statement. What do you mean by "alternative" here? It's a packaging/distribution error if two wheels or sdists, claiming to be for the same project/version, are different. So I don't understand how it can be important to know where a package came from, for any legitimate usage.

sambhav commented 2 years ago

These packages may contain alternative versions of packages from pypi.

Please clarify this statement. What do you mean by "alternative" here? It's a packaging/distribution error if two wheels or sdists, claiming to be for the same project/version, are different. So I don't understand how it can be important to know where a package came from, for any legitimate usage.

There are various places where this might come into play -

For eg, there are alternative wheels for various scientific packages that are compiled with specific libraries - for eg https://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy or for torch which has wheels with the same name for various cuda versions - https://download.pytorch.org/whl/cu101/torch_stable.html vs https://download.pytorch.org/whl/cu102/torch_stable.html vs https://download.pytorch.org/whl/cu110/torch_stable.html
Enterprises often host internal mirrors/caches of pypi which may contain wheels or patched versions of various packages which override the upstream pypi published version

In cases like these and others it is vital to know where pip downloaded the package from since you cannot directly trace a package/version (+local version in cases like above) to a package repository it originated from if it's not the default repo set in the pip config.

It is also not possible to determine it from pip config since pip could have been provided with extra config vars like --extra-index-url or --find-links during that specific invocation. (For eg. torch recommends that you install alternative packages like so -

pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

from https://pytorch.org/)

pfmoore commented 2 years ago

There are various places where this might come into play

Those are all the sort of "not legitimate" usages I tried to point out in my reply. If the builds differ, they should be distinguished using different local version identifiers added to the version, or something similar. Or the install should be from a direct URL rather than a name/version specifier, if the file has to come from a particular location.

Maybe there are issues with doing things like this. But these are the principles on which the current packaging standards are based, and if they aren't sufficient, this should be fixed by changing the standards that all tools follow, not by changing a single tool (pip) to support a non-standard usage.

Note that the torch example you give uses a local identifier, so they are following the standards. But you shouldn't be assuming that the package must come from https://download.pytorch.org/whl/cu113/torch_stable.html - you could set up a local mirror, or get it from a redistributor, or whatever. What matters is that you get torch==1.10.1+cu113, not where you get it from.

So maybe the issue here isn't with how packages like torch distribute their binaries, but how your workflow is set up to manage package sources.

sambhav commented 2 years ago

Again, the goal of this ticket is around finding provence data so that we can construct accurate bill of materials for a given set of packages. This is so that they can later be scanned and identified.

Even if you have a package with a version and local version identifier, it's still impossible to map that back to where it's downloaded from if you want to reproduce your environment.

Or the install should be from a direct URL rather than a name/version specifier, if the file has to come from a particular location.

Yes, and it is exactly for cases like these that I wish to record where pip downloaded the original package from.

An example is let's say you are given a package and you would like to verify it's integrity after the fact or reproduce the original environment. You cannot do that unless you know the original source where the package was downloaded from.

and if they aren't sufficient, this should be fixed by changing the standards that all tools follow, not by changing a single tool (pip) to support a non-standard usage.

I am not asking pip to support anything rather than record some metadata as to where it fetched packages from. It will not change the install behavior and nor will it support any new ways of installing packages. This is just to have a record or audit log of operations that pip performed. It can also be stored in some pip specific file instead of dist info in case that violates certain specs.

pfmoore commented 2 years ago

Pip doesn't provide that level of audit trail for where installed packages come from. You need to manage that sort of provenance outside of pip, IMO.

Longer term, you may be interested in https://discuss.python.org/t/pep-665-take-2-a-file-format-to-list-python-dependencies-for-reproducibility-of-an-application/11736 which is proposing a reproducible lockfile format that might suit your needs.

pradyunsg commented 2 years ago

As I understand this, this request is born out of wanting to generate a SBOM (Software bill of materials) which is increasingly being used in corporate spaces[^1] to track what pieces of software are being used. https://github.com/anchore/syft/issues/680 is probably where this came out of, and that project tries to generate the SBOM post-facto IIUC.

I tend to agree that this isn't something that pip can solve on its own -- this needs someone to look into how this metadata should be captured for Python packages as a whole (i.e. design a general standard, similar to https://www.python.org/dev/peps/pep-0610/; expanding something like that to non-direct-url installations).

[^1]: Correspondingly, I'm... bleh... on having volunteers spend time designing and developing a solution for this.

pradyunsg commented 2 years ago

/cc @woodruffw who worked on https://github.com/trailofbits/pip-audit/, and likely has a better idea of what generating an SBOM for Python environments looks like / needs.

pfmoore commented 2 years ago

As I understand this, this request is born out of wanting to generate a SBOM (Software bill of materials)

Ah, that's interesting as background. And that does tend to push me even further towards thinking that this should be handled outside of pip. Organisations which want this sort of tracking can mandate that a specific tool[^1] is used for all installs, and that tool can layer the necessary audit trail management on top of the basic install.

I also agree with @pradyunsg in terms of being uncomfortable with the ideal of volunteers designing and developing, and even more so maintaining such a mechanism, which is directed squarely at commercial users of pip. Even if someone came up with funding for the development of such a feature within pip, it would still be volunteers handling the support, and dealing with frustrated users when their auditors are pushing for data they don't have.

I'd much rather see this sort of requirement satisfied by a commercial, paid for service[^2] that uses pip internally, than have it become a pip feature.

I am not asking pip to support anything rather than record some metadata as to where it fetched packages from.

That metadata would need to be standardised, and only then would pip implement that standard. If someone wants to propose such a metadata standard, then that's perfectly fine. Because in that case, we'd do what the standard says, it wouldn't be on us as volunteers to debate or decide whether the metadata we write addresses the "bill of materials" requirement.

[^1]: Which can wrap pip for the actual install machinery. [^2]: Or a free one - I don't want to dictate what other volunteers are willing to spend their time on 🙂

pradyunsg commented 2 years ago

this request is born out of wanting to generate a SBOM (Software bill of materials)

With #53 and the new --report flag, users who care about generating SBOMs and knowing about the exact artifact used by pip, can get that information as part of the installation report -- users can munge that information into whatever SBOM format they like. If you have feedback to provide on the format, please feel free to comment on that here or file a new issue for it.

Note that the format for that file is experimental for now (to allow us to make changes, based on initial feedback) and that this will need to live on the "build" tooling that you have -- you'll need to actually wrap pip, with something that'll ensure that the installation report is generated by pip and transform the information in the installation report to whatever SBOM format you like/use. I'll flag that https://github.com/trailofbits/pip-audit is a thing that exists as well.

The relevant documentation, on the installation report format is available at https://pip.pypa.io/en/stable/reference/installation-report/

sbidoul commented 2 years ago

I labeled this issue with needs standard, since to record this information (presumably in .dist-info) a new interoperability standard has to be designed.

sethmlarson commented 1 year ago

For folks who are following this issue thread, there exists a PEP for such a standard: https://discuss.python.org/t/pep-710-recording-the-provenance-of-installed-packages/25428

pypa / pip