Improve resolution of Python / PIP dependencies

sschuberth commented 2 years ago

ORT's analyzer has various problems with resolving Python / PIP dependencies

[x] Dependencies on native packages require native system tools to be installed, see #4578.
[x] The Python-3-compatibility check might fail, see #4289.
[x] Any specifically requested Python version is not adhered to, see #3671.
[x] We use some rather obscure helper scripts based on abandoned projects, see #2816.
[x] We might have some general problems with retrieving metadata:
- [x] #812
- [x] #509
- [x] #485
- [x] #5159
- [x] #5740

sschuberth commented 2 years ago

Possible solution to the above include @pombredanne's proposal for an ACT-funded "Project-Multi Python-version dependencies resolver", or leveraging / extending existing tools like https://github.com/ddelange/pipgrip.

sschuberth commented 2 years ago

or leveraging / extending existing tools like https://github.com/ddelange/pipgrip.

See in particular https://github.com/ddelange/pipgrip/issues/40.

sschuberth commented 2 years ago

Also maybe worth a look as a helper tool is https://github.com/trailofbits/it-depends which claims to

Finds native dependencies for high level languages like Python

pombredanne commented 2 years ago

@sschuberth re:

Also maybe worth a look as a helper tool is https://github.com/trailofbits/it-depends which claims to

Finds native dependencies for high level languages like Python

From a quick look they seem to:

create a docker image in https://github.com/trailofbits/it-depends/blob/8f8988330239c6d3eb39f05988fdbe6802f4bbbe/it_depends/pip.py#L35
run pip directly https://github.com/trailofbits/it-depends/blob/8f8988330239c6d3eb39f05988fdbe6802f4bbbe/it_depends/pip.py#L176 or through https://github.com/wimglenn/johnnydep/blob/master/johnnydep/pipper.py

sschuberth commented 2 years ago

Also see the difficulties in finding Python 2 example projects.

sschuberth commented 2 years ago

We could also take a deeper look at component-detection's approach for PIP.

sschuberth commented 2 years ago

Some interesting insights on the general topic from a Python maintainer, and a possible solution.

sschuberth commented 2 years ago

And yet another interesting discussion with links to:

pombredanne commented 2 years ago

@sschuberth FWIW, ScanCode does parse requirements files, setup.py, setup.cfg, pyproject.toml, Pipfile and Pipfile.lock and a few more and has what is likely the best requirements parser around https://github.com/nexB/pip-requirements-parser also used in CycloneDX. You can see the code in action in https://github.com/nexB/scancode-toolkit/blob/syspacfiles/src/packagedcode/pypi.py We also parse various Python metadata files and detect packages in various installed, archive and extracted layouts. We maintain https://github.com/nexB/dparse2 and https://github.com/nexB/pkginfo2 for additional manifest formats and https://github.com/nexB/univers to parse all versions including all Python package versions. We also built utilities to resolve, collect and download actual package archives based on these. And we are continuously adding support for new formats as they come.

sschuberth commented 2 years ago

ScanCode does parse requirements files, setup.py, setup.cfg, pyproject.toml, Pipfile and Pipfile.lock and a few more

Can you clarify on what "parse" means here exactly? I assume in the context of ScanCode only declared license data is parsed, but not declared direct and implied transitive dependencies, incl. resolution of version ranges to concrete versions. Correct?

pombredanne commented 2 years ago

Can you clarify on what "parse" means here exactly? I assume in the context of ScanCode only declared license data is parsed, but not declared direct and implied transitive dependencies, incl. resolution of version ranges to concrete versions. Correct?

By parse I mean collecting the data as they are and found locally without making any network call, e.g. this means:

parsing and normalizing actual package manifests (and of course all the declared data there such as licenses)
extracting direct dependencies constraints from manifests,
extracting resolved dependency versions from lockfiles,
collecting any extra data available from lockfiles (some formats have more data in their lockfiles, like newer npm lockfiles or PHP composer may contains declared license info).

This does not mean resolving dependencies and getting extra data for these dependencies yet: for Python and PyPI proper that's been the essence of the proposal I had put forward to the ACT project.

Now this will eventually happen as all parts are mostly in place now:

ScanCode collects all the explicit dependencies
Univers knows how to parse and make sense of most package version, version constraints and version ranges and how to resolve and evaluate versions constraints to concrete versions given ranges.
VulnerableCode and FetchCode both know how to get the list of versions for a package by querying upstream registries APIs.
FetchCode knows how to fetch actual package metadata from these API and also fetch the code.

The last step will be to bring these together: as it is, this could already be used to resolve transitive dependencies using a simple strategy such as getting the latest version. It would later benefit from adding extra version resolvers to emulate the behaviour of package managers such the pip solver (this was the ACT proposal), the pubgrub solver, the maven solver, etc.

pombredanne commented 2 years ago

Some updates that are likely relevant here: https://github.com/nexB/python-inspector is now out and has been designed specifically to be integrated in ort and resolve pip dependencies without having the constraints of running pip. And see https://github.com/nexB/ort/pull/1 for the working ort integration that we are refining there first before submitting to ort proper

python-inspector does resolve transitive dependencies.

oss-review-toolkit / ort

Improve resolution of Python / PIP dependencies #4637